Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Frank Ch. Eigler
Hi -

On Mon, Apr 16, 2007 at 11:36:05PM +0200, Andi Kleen wrote:
> Christoph Hellwig <[EMAIL PROTECTED]> writes:
> > and [systemtap] does a lot of really wrong things in it's
> > runtime). [...]

(Thanks, Christoph, for at least a few specifics.  Some of them have
already been dealt with in the recent past.)


> I must agree with that. Perhaps it would be good if its runtime code
> was posted to l-k at some point and reviewed in the standard way
> even when it isn't merged.

I'll let the runtime's maintainers judge whether this particular venue
would be helpful.  But is the choice of venue really an obstacle?
Everyone who cares is *already* welcome to browse the code (available
on CVS, cvsweb, tarballs - would git help?), and critique it (e.g. on
our open public mailing list or via bugzilla).

http://sourceware.org/systemtap/getinvolved.html

- FChE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Andi Kleen
Christoph Hellwig <[EMAIL PROTECTED]> writes:

> and does a lot of really wrong things in it's runtime).

I must agree with that. Perhaps it would be good if its runtime code
was posted to l-k at some point and reviewed in the standard way
even when it isn't merged.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 10:08:27AM -0400, Theodore Tso wrote:
> On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:
> > With systemtap scripts, you could walk pagetables and print *the exact
> > page information you want*, or you could walk pfns, or LRU, or page_tree,
> > or walk the page tree then the rmap structures. And you can selectively
> > cull out items you don't care about if you only care about a subset of
> > items, based on arbitrary criteria. And you can most likely do all that
> > more efficiently than with a conglomeration of various /proc files
> > (assuming they even provide what you want in the first place).
> 
> Yes, but maintaining the systemtap scripts will be a nightmare, since
> they would be outside the kernel, and as we change our internal data
> structure, the scripts would become useless.
> 
> This is a fundamental problem with systemtap that we haven't been able
> to solve yet, because solving it would freeze various internal data
> structures or kernel functions.  I agree that's not acceptable; which
> is why I don't think systemtap would be a good match for the problem
> we're trying to solve here.

It's also fundamentally not solveable.  Even Sun doesn't guarantee
dor dtrace scripts to be portable, because it simply means you'd
have to freeze all internals.  Of course systemtap managment with their
execute visibility and plain stupidity of copying whatever sun does will
never ever get it.  This whole mess will only be solvable if IBM fires
the right people in managment.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 05:17:00PM -0400, Frank Ch. Eigler wrote:
> It may be worthwhile to remind people that it is easy to use systemtap
> only to the extent of automating the placement of kprobes: just to
> perform the function-name/source-file/line-number triplet to PC
> mapping.  They can use embedded-C code to do all the same stuff they'd
> do with kprobes.  They are not obligated to write any odd script code
> for probing logic, nor indeed use any of this really wrong runtime.

Umm, yes- as long as you write systemtap the runtime gets linked in
currently.  That doesn't mean you actually use a lot of it in the end,
but the maintaince horror of actually getting all the junk code to
compile still is there.

Now the actual function-name/source-file/line-number triplet to PC is
really useful functionality, and for my tracing work I could really
use this a lot.  Unfortunately systemtap doesn't have a proper layered
approach and you can't use this bit without pulling in all the
junk.  If started some dward based function-name/source-file/line-number
of my own based on acme's work, but it's stalled due to more important
issues going on.

> > We could not really distribute systemtap scripts with the kernel.
> > systemtap is a bloody complicated piece of [software] 
> 
> I don't know if that should be treated a compliment to our team, for
> being able to work quickly on something that a full-grown kernel
> developer finds bloody complicated.  Perhaps your information is
> simply outdated.  Big & bloated?  We have several times asked for
> specifics rather than smears - what about it?

There's a lot of stuff unneeded for basic tasks.  But if you want
a detailed review you could submit your runtime bits for review and
get feedback from everyone.

> > outside the kernel tree that breaks all the time we change kernel
> > internals. [...]
> 
> That's begging the question.  If kernel folks are willing to maintain
> some included systemtap-related code, then by definition it would not
> break all the time.

We'll definitly need a trace transport.  I currently use a hackish
kfifo rinbuffer derived from net/ipv4/tcp_probe.c, but it's showing
it's limitations.  Tom promised long ago to factor our the trace
code from blktrace into generic bits, but as he doesn't deliver
I suspect I'll have to do that myself soon.

The safe dereference bits are a bit questionable, but at least worth
a try to put into the tree proper, because there's no chance they'd
be properly maintained outside.

The register dumps you do would could definitly stand some integration
with the register dumps in panic messages, and would be useful library
functions for proper C language kprobes, but that means detangling
the core from the utterly horrible systemtrap pascal string handling.

Stack backtrace handling could use some integration with the stack
tracing framework in for lockdep and fault injection and be available
more genericly for C kprobes.

With a proper tracing infrastructure we'll need the timing bits for
it aswell, which should superceed the utter mess in systemtap in
that area (I'm hoping for Matthew to come up with something there
as part of lttng)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 05:17:00PM -0400, Frank Ch. Eigler wrote:
 It may be worthwhile to remind people that it is easy to use systemtap
 only to the extent of automating the placement of kprobes: just to
 perform the function-name/source-file/line-number triplet to PC
 mapping.  They can use embedded-C code to do all the same stuff they'd
 do with kprobes.  They are not obligated to write any odd script code
 for probing logic, nor indeed use any of this really wrong runtime.

Umm, yes- as long as you write systemtap the runtime gets linked in
currently.  That doesn't mean you actually use a lot of it in the end,
but the maintaince horror of actually getting all the junk code to
compile still is there.

Now the actual function-name/source-file/line-number triplet to PC is
really useful functionality, and for my tracing work I could really
use this a lot.  Unfortunately systemtap doesn't have a proper layered
approach and you can't use this bit without pulling in all the
junk.  If started some dward based function-name/source-file/line-number
of my own based on acme's work, but it's stalled due to more important
issues going on.

  We could not really distribute systemtap scripts with the kernel.
  systemtap is a bloody complicated piece of [software] 
 
 I don't know if that should be treated a compliment to our team, for
 being able to work quickly on something that a full-grown kernel
 developer finds bloody complicated.  Perhaps your information is
 simply outdated.  Big  bloated?  We have several times asked for
 specifics rather than smears - what about it?

There's a lot of stuff unneeded for basic tasks.  But if you want
a detailed review you could submit your runtime bits for review and
get feedback from everyone.

  outside the kernel tree that breaks all the time we change kernel
  internals. [...]
 
 That's begging the question.  If kernel folks are willing to maintain
 some included systemtap-related code, then by definition it would not
 break all the time.

We'll definitly need a trace transport.  I currently use a hackish
kfifo rinbuffer derived from net/ipv4/tcp_probe.c, but it's showing
it's limitations.  Tom promised long ago to factor our the trace
code from blktrace into generic bits, but as he doesn't deliver
I suspect I'll have to do that myself soon.

The safe dereference bits are a bit questionable, but at least worth
a try to put into the tree proper, because there's no chance they'd
be properly maintained outside.

The register dumps you do would could definitly stand some integration
with the register dumps in panic messages, and would be useful library
functions for proper C language kprobes, but that means detangling
the core from the utterly horrible systemtrap pascal string handling.

Stack backtrace handling could use some integration with the stack
tracing framework in for lockdep and fault injection and be available
more genericly for C kprobes.

With a proper tracing infrastructure we'll need the timing bits for
it aswell, which should superceed the utter mess in systemtap in
that area (I'm hoping for Matthew to come up with something there
as part of lttng)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 10:08:27AM -0400, Theodore Tso wrote:
 On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:
  With systemtap scripts, you could walk pagetables and print *the exact
  page information you want*, or you could walk pfns, or LRU, or page_tree,
  or walk the page tree then the rmap structures. And you can selectively
  cull out items you don't care about if you only care about a subset of
  items, based on arbitrary criteria. And you can most likely do all that
  more efficiently than with a conglomeration of various /proc files
  (assuming they even provide what you want in the first place).
 
 Yes, but maintaining the systemtap scripts will be a nightmare, since
 they would be outside the kernel, and as we change our internal data
 structure, the scripts would become useless.
 
 This is a fundamental problem with systemtap that we haven't been able
 to solve yet, because solving it would freeze various internal data
 structures or kernel functions.  I agree that's not acceptable; which
 is why I don't think systemtap would be a good match for the problem
 we're trying to solve here.

It's also fundamentally not solveable.  Even Sun doesn't guarantee
dor dtrace scripts to be portable, because it simply means you'd
have to freeze all internals.  Of course systemtap managment with their
execute visibility and plain stupidity of copying whatever sun does will
never ever get it.  This whole mess will only be solvable if IBM fires
the right people in managment.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Andi Kleen
Christoph Hellwig [EMAIL PROTECTED] writes:

 and does a lot of really wrong things in it's runtime).

I must agree with that. Perhaps it would be good if its runtime code
was posted to l-k at some point and reviewed in the standard way
even when it isn't merged.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-16 Thread Frank Ch. Eigler
Hi -

On Mon, Apr 16, 2007 at 11:36:05PM +0200, Andi Kleen wrote:
 Christoph Hellwig [EMAIL PROTECTED] writes:
  and [systemtap] does a lot of really wrong things in it's
  runtime). [...]

(Thanks, Christoph, for at least a few specifics.  Some of them have
already been dealt with in the recent past.)


 I must agree with that. Perhaps it would be good if its runtime code
 was posted to l-k at some point and reviewed in the standard way
 even when it isn't merged.

I'll let the runtime's maintainers judge whether this particular venue
would be helpful.  But is the choice of venue really an obstacle?
Everyone who cares is *already* welcome to browse the code (available
on CVS, cvsweb, tarballs - would git help?), and critique it (e.g. on
our open public mailing list or via bugzilla).

http://sourceware.org/systemtap/getinvolved.html

- FChE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-14 Thread Maneesh Soni
On Thu, Apr 12, 2007 at 09:23:45PM -0500, Matt Mackall wrote:
> On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote:
> > Matt Mackall wrote:
> > >On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
> > 
> > >>If kprobes is simply crappy and doesn't work properly for this, then I
> > >>could accept that. I'm not someone trying to get this info. So why can't
> > >>it be used? (not just for kpagemap, but for clear_refs and all that gunk
> > >>too).
> > >
> > >
> > >kprobes is good for looking at events, but bad for looking at state.
> > >Especially metric shitloads of state.
> > 
> > Why? Why is a kprobes trap significantly more expensive than a read
> > syscall?
> 
> I guess I'm not clear on what you're proposing. From my understanding
> of kprobes (admittedly not an expert), this is hard to do and not a
> very good match.
> 
> > >>Maybe. How about LRU? Reclaim performance is bad, and you want to work out
> > >>which pages keep going off the end of it, or which pages keep getting
> > >>written out via it, or who's pages are on the active list, forcing mine
> > >>out.
> > >
> > >
> > >Those are actually probably a good match for systemtap as they're all 
> > >events.
> > 
> > Traverse the LRU? Which files do they belong to? What process maps them?
> 
> -ENOPARSE.
> 

For non-event based data gathering using kprobes we can have a debugfs file
like /debug/kprobes/snapshot_probe and write a kprobe module with probe at
->write() function and then the user space can trigger the data collection

echo "1" > /debug/kprobes/snapshot_probe

Thus, the actual data collection code can reside in a separate
module or a systemtap script which provides very good post-processing
capabalities, and can be used without recompiling or rebooting the kernel.

Thanks
Maneesh

-- 
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab, 
Bangalore, India
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-14 Thread Maneesh Soni
On Thu, Apr 12, 2007 at 09:23:45PM -0500, Matt Mackall wrote:
 On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote:
  Matt Mackall wrote:
  On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
  
  If kprobes is simply crappy and doesn't work properly for this, then I
  could accept that. I'm not someone trying to get this info. So why can't
  it be used? (not just for kpagemap, but for clear_refs and all that gunk
  too).
  
  
  kprobes is good for looking at events, but bad for looking at state.
  Especially metric shitloads of state.
  
  Why? Why is a kprobes trap significantly more expensive than a read
  syscall?
 
 I guess I'm not clear on what you're proposing. From my understanding
 of kprobes (admittedly not an expert), this is hard to do and not a
 very good match.
 
  Maybe. How about LRU? Reclaim performance is bad, and you want to work out
  which pages keep going off the end of it, or which pages keep getting
  written out via it, or who's pages are on the active list, forcing mine
  out.
  
  
  Those are actually probably a good match for systemtap as they're all 
  events.
  
  Traverse the LRU? Which files do they belong to? What process maps them?
 
 -ENOPARSE.
 

For non-event based data gathering using kprobes we can have a debugfs file
like /debug/kprobes/snapshot_probe and write a kprobe module with probe at
-write() function and then the user space can trigger the data collection

echo 1  /debug/kprobes/snapshot_probe

Thus, the actual data collection code can reside in a separate
module or a systemtap script which provides very good post-processing
capabalities, and can be used without recompiling or rebooting the kernel.

Thanks
Maneesh

-- 
Maneesh Soni
Linux Technology Center,
IBM India Systems and Technology Lab, 
Bangalore, India
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Frank Ch. Eigler

Christoph Hellwig <[EMAIL PROTECTED]> writes:

> [...]
> > merge it in the first place?
> 
> It's very nice to poke deep into the kernel for development purposes.
> For example for the spu scheduler work I'm doing currently I have
> a module using kprobes (note the systemtap crap because it's big, bloated,
> in and odd language, and does a lot of really wrong things in its runtime).

It may be worthwhile to remind people that it is easy to use systemtap
only to the extent of automating the placement of kprobes: just to
perform the function-name/source-file/line-number triplet to PC
mapping.  They can use embedded-C code to do all the same stuff they'd
do with kprobes.  They are not obligated to write any odd script code
for probing logic, nor indeed use any of this really wrong runtime.

> This module allows me to put probes into various places in the scheduler
> and writes them into a ringbuffer with timestampts allowing me to
> trace what's going on there.  This is really neat.  [...]

Indeed, and we too try to make this simple & fast: a couple of lines.

> [...] To summarize, I really love kprobes to ease my debugging work,
> but using it for any kind of production code is a total nightmare.

But at some point, some interface needs to be fixed for a final
user-space tool.  Whether that interface fixing is performed by kernel
developers being more reluctant to rewrite basic things, or by
providing a proc interface, or maintaining a kprobes module does not
matter.  Someone will feel constrained, and someone will be liberated.

One neat thing about our systemtap tool is that, no matter what layer
such interfaces become fixed within, it can probably interface to
them.  If there is no fixed interface, it can go down to debugging
info.  If there are tracing hooks present, it can attach.  It can make
appear as unified the disparate standardization policies of different
subsystems.


> > We could distribute some systemtap scripts, and even distribute some
> > basic useful ones like this sort of page info in the kernel source
> > tree.
>
> We could not really distribute systemtap scripts with the kernel.
> systemtap is a bloody complicated piece of [software] 

I don't know if that should be treated a compliment to our team, for
being able to work quickly on something that a full-grown kernel
developer finds bloody complicated.  Perhaps your information is
simply outdated.  Big & bloated?  We have several times asked for
specifics rather than smears - what about it?

> outside the kernel tree that breaks all the time we change kernel
> internals. [...]

That's begging the question.  If kernel folks are willing to maintain
some included systemtap-related code, then by definition it would not
break all the time.


- FChE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Andrew Morton
On Fri, 13 Apr 2007 12:24:51 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote:

> > > From /proc/kpagemap + /proc/*/pagemap, you can
> > > basically synthesize any statistic you want, including all the
> > > existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
> > > be considerably more efficient.
> > 
> > You'd need to poke clear_refs beforehand to make the referenced bits useful.
> > 
> > Actually, we also need to run around the ptes and collect the pte-referenced
> > bits too.  I don't think your code copes with any of that?
> 
> No, and it probably should. Perhaps dirty as well, though I've kindof
> lost the plot on how that works lately.

Dirty is OK: the VM keeps pte-dirtiness and page-dirtiness in sync now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Matt Mackall
On Fri, Apr 13, 2007 at 10:03:56AM -0700, Andrew Morton wrote:
> On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote:
> 
> > > It *will* be viable.  If the application wants to know if a page is dirty,
> > > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
> > > numerical offset when inspecting fields in /proc/kpagemap.  If correctly
> > > designed, such a monitoring application will be able to report upon page
> > > flags which we haven't even thought up yet.
> > 
> > We can probably fit this in the existing (variable-sized) header.
> 
> hm, OK..
> 
> > > > I wonder what they are needed for.
> > > 
> > > Poking deeply into the kernel to provide information about kernel state. 
> > > 
> > > There are real-world needs for this, and the people who develop tools to
> > > process this information will have decent kernel understanding and will
> > > know that the file's contents may alter across kernel versions.  It sure
> > > beats poking around in /dev/kmem.
> > > 
> > > I doubt if there's a sensible way in which we can prettify this interface
> > > without losing information.  But we should aim to make it as robust as
> > > possible agaisnt future kenrel changes, of course.
> > > 
> > > And we should satisfy ourselves that all the required information has been
> > > made available.  The fact that it will satisfy the Oracle requirement is
> > > encouraging.
> > > 
> > > Matt, these changes make the new field in /proc/pid/smaps redundant, don't
> > > they?
> > 
> > Which new field?
> 
> Referenced:
> 
> > From /proc/kpagemap + /proc/*/pagemap, you can
> > basically synthesize any statistic you want, including all the
> > existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
> > be considerably more efficient.
> 
> You'd need to poke clear_refs beforehand to make the referenced bits useful.
> 
> Actually, we also need to run around the ptes and collect the pte-referenced
> bits too.  I don't think your code copes with any of that?

No, and it probably should. Perhaps dirty as well, though I've kindof
lost the plot on how that works lately.

> > But in general, most of the statistics in smaps are basically useless
> > for shared mappings, just like RSS. Problem is, we really don't know
> > what statistics we want yet, or even if it can be distilled down to
> > simple numbers anyway.
> 
> yup.  But that's the whole point, really: don't prejudge what info userspace
> is trying to collect.

Right.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Matt Mackall
On Fri, Apr 13, 2007 at 12:18:56PM +1000, Nick Piggin wrote:
> Can't you just traverse arbitrary kernel data structures at a given point
> in time, exactly like the /proc/ call is doing?

Perhaps.

My understanding is that you hook a kprobe to an event. An event is a
particular instruction getting executed. Indeed, you can do whatever
poking around in the kernel you want at that point. And then you can
stuff that data in a buffer that eventually gets to userspace.

This is very different from a read/seek/syscall. Rather than just
asking the kernel for some data, we have to wait for the relevent
events. Now, of course, you can make an ugly hack like hooking
sys_getpid() and basically make your own system call. Hopefully no one
else will call getpid() while you're doing this, etc. Not really how
it's intended to work at all, and probably a bitch to use, but
possible. Then the question becomes: why don't we do this for
everything else in /proc?

And the answer of course is: we put stuff in /proc because it's
generally useful. Extra points if it's actually related to
'proc'esses. Being able to tell what's paged in in a given mapping is
useful. Being able to tell what's shared between two mappings is
useful. Being able to get an accurate, meaningful picture of how your
memory is being used is useful. Heck, I bet some people might find it
useful to be able to see what nodes the pages in their process are on.
All stuff you shouldn't need to be a kernel hacker to answer.

The flags part of /proc/kpagemap exposes some (very interesting!)
implementation details. The rest of it is completely generic to any
system with a VM. It's only deep kernel magic in the sense that it's
not yet exposed.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Andrew Morton
On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote:

> > It *will* be viable.  If the application wants to know if a page is dirty,
> > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
> > numerical offset when inspecting fields in /proc/kpagemap.  If correctly
> > designed, such a monitoring application will be able to report upon page
> > flags which we haven't even thought up yet.
> 
> We can probably fit this in the existing (variable-sized) header.

hm, OK..

> > > I wonder what they are needed for.
> > 
> > Poking deeply into the kernel to provide information about kernel state. 
> > 
> > There are real-world needs for this, and the people who develop tools to
> > process this information will have decent kernel understanding and will
> > know that the file's contents may alter across kernel versions.  It sure
> > beats poking around in /dev/kmem.
> > 
> > I doubt if there's a sensible way in which we can prettify this interface
> > without losing information.  But we should aim to make it as robust as
> > possible agaisnt future kenrel changes, of course.
> > 
> > And we should satisfy ourselves that all the required information has been
> > made available.  The fact that it will satisfy the Oracle requirement is
> > encouraging.
> > 
> > Matt, these changes make the new field in /proc/pid/smaps redundant, don't
> > they?
> 
> Which new field?

Referenced:

> From /proc/kpagemap + /proc/*/pagemap, you can
> basically synthesize any statistic you want, including all the
> existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
> be considerably more efficient.

You'd need to poke clear_refs beforehand to make the referenced bits useful.

Actually, we also need to run around the ptes and collect the pte-referenced
bits too.  I don't think your code copes with any of that?
 
> But in general, most of the statistics in smaps are basically useless
> for shared mappings, just like RSS. Problem is, we really don't know
> what statistics we want yet, or even if it can be distilled down to
> simple numbers anyway.

yup.  But that's the whole point, really: don't prejudge what info userspace
is trying to collect.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Matt Mackall
On Thu, Apr 12, 2007 at 05:42:01PM -0700, Andrew Morton wrote:
> On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > >>+   ((char *)page)[1] = PAGE_SHIFT;
> > > 
> > > 
> > > OK.
> > 
> > Shouldn't we just expose page size and endianness by other means? (another 
> > file or
> > syscall).
> 
> I don't think so - this file exposes fairly deep kernel internals and
> that's unavoidable, really - it's *supposed* to do that.  It is explicitly
> designed for monitoring kernel behaviour.
> 
> So it needs special handling by userspace.  Keeping the number of files
> which need such special handling to a minimum will keep the number of
> applications which are exposed to kernel changes to a minimum.
> 
> > >>+   for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) {
> > >>+   ppage = pfn_to_page(pfn);
> > >>+   if (!ppage) {
> > >>+   page[i] = 0;
> > >>+   page[i + 1] = 0;
> > >>+   } else {
> > >>+   page[i] = ppage->flags;
> > >>+   page[i + 1] = atomic_read(>_count);
> > >>+   }
> > >>+   }
> > > 
> > > 
> > > Not a good idea to expose raw flags in this manner - it changes at the 
> > > drop
> > > of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
> > > mapping to make this viable.
> > 
> > I don't think it is viable because that makes the flags part of the
> > userspace ABI.
> 
> It *will* be viable.  If the application wants to know if a page is dirty,
> it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
> numerical offset when inspecting fields in /proc/kpagemap.  If correctly
> designed, such a monitoring application will be able to report upon page
> flags which we haven't even thought up yet.

We can probably fit this in the existing (variable-sized) header.
 
> > I wonder what they are needed for.
> 
> Poking deeply into the kernel to provide information about kernel state. 
> 
> There are real-world needs for this, and the people who develop tools to
> process this information will have decent kernel understanding and will
> know that the file's contents may alter across kernel versions.  It sure
> beats poking around in /dev/kmem.
> 
> I doubt if there's a sensible way in which we can prettify this interface
> without losing information.  But we should aim to make it as robust as
> possible agaisnt future kenrel changes, of course.
> 
> And we should satisfy ourselves that all the required information has been
> made available.  The fact that it will satisfy the Oracle requirement is
> encouraging.
> 
> Matt, these changes make the new field in /proc/pid/smaps redundant, don't
> they?

Which new field? From /proc/kpagemap + /proc/*/pagemap, you can
basically synthesize any statistic you want, including all the
existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
be considerably more efficient.

But in general, most of the statistics in smaps are basically useless
for shared mappings, just like RSS. Problem is, we really don't know
what statistics we want yet, or even if it can be distilled down to
simple numbers anyway.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Theodore Tso
On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:
> With systemtap scripts, you could walk pagetables and print *the exact
> page information you want*, or you could walk pfns, or LRU, or page_tree,
> or walk the page tree then the rmap structures. And you can selectively
> cull out items you don't care about if you only care about a subset of
> items, based on arbitrary criteria. And you can most likely do all that
> more efficiently than with a conglomeration of various /proc files
> (assuming they even provide what you want in the first place).

Yes, but maintaining the systemtap scripts will be a nightmare, since
they would be outside the kernel, and as we change our internal data
structure, the scripts would become useless.

This is a fundamental problem with systemtap that we haven't been able
to solve yet, because solving it would freeze various internal data
structures or kernel functions.  I agree that's not acceptable; which
is why I don't think systemtap would be a good match for the problem
we're trying to solve here.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

Ananth N Mavinakayanahalli wrote:

On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:



It definitely seems like you can use some kernel functions, but the
ones I saw may just be systemtap facilities. But what is so surprising
about being able to call a kernel function when running in kernel
context? Perhaps there is some fundamental limitation of kprobes that
I don't understand.



The main requirement for kprobes handlers is that they can't sleep. You
could definitely call a kernel function from kprobe handlers as long as
the function doesn't sleep.


That would be enough to access basically all the VM data structures.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Ananth N Mavinakayanahalli
On Fri, Apr 13, 2007 at 12:54:36PM +1000, Nick Piggin wrote:
> Matt Mackall wrote:
> >On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote:
> >
> >>Matt Mackall wrote:
> >>
> >>>On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
> >>
> If kprobes is simply crappy and doesn't work properly for this, then I
> could accept that. I'm not someone trying to get this info. So why can't
> it be used? (not just for kpagemap, but for clear_refs and all that gunk
> too).
> >>>
> >>>
> >>>kprobes is good for looking at events, but bad for looking at state.
> >>>Especially metric shitloads of state.
> >>
> >>Why? Why is a kprobes trap significantly more expensive than a read
> >>syscall?
> >
> >
> >I guess I'm not clear on what you're proposing. From my understanding
> >of kprobes (admittedly not an expert), this is hard to do and not a
> >very good match.
> 
> But you have an idea that it is bad for exposing lots of data. Why?
> (I'm not a kprobes expert either, these are not rhetorical questions)

You could tie your kprobe module to use relay channels. Kprobe handlers
run lockless and using the per-cpu relay channels will provide a fast
transport mechanism for exposing lots of data.

http://relayfs.sourceforge.net/examples.html#tprintk_kprobes is an
example using the earlier relayfs interface. It shouldn't be that hard
to change it to use the newer relay stuff.

AFAIK acme is using a similar mechanism for ctracer
(http://oops.ghostprotocols.net:81/blog/?p=50)

Ananth
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Ananth N Mavinakayanahalli
On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:
> Andrew Morton wrote:
> >On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> 
> >wrote:
> >
> >
> >>>I guess one could generate an answer to the static question with 
> >>>systemtap,
> >>>by accumulating running counts across the application lifetime and then
> >>>snapshotting them.  Sounds hard though.
> >>
> >>Can't you just traverse arbitrary kernel data structures at a given point
> >>in time, exactly like the /proc/ call is doing?
> >
> >
> >Do a full pagetable walk, with all the associated locking from within
> >a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
> >in C, perhaps.
> 
> It looks like you can traverse arbitrary data structures, yes.
> 
> It definitely seems like you can use some kernel functions, but the
> ones I saw may just be systemtap facilities. But what is so surprising
> about being able to call a kernel function when running in kernel
> context? Perhaps there is some fundamental limitation of kprobes that
> I don't understand.

The main requirement for kprobes handlers is that they can't sleep. You
could definitely call a kernel function from kprobe handlers as long as
the function doesn't sleep.

Ananth
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 06:25:46PM +1000, Nick Piggin wrote:
> But at least make it into its own module with a debugfs interface or
> something. I mean, exposing a PG_name-to-nr and page count pfn and flags
> as a supposedly formal proc interface doesn't sound nice to me. Page
> flags does not tell you what is going on in the VM, it gives you a tiny
> window into "something". Between reading a /proc/pid/ pfn and finding
> the pfn's page flags, it could be used for something completely different
> anyway.

I agree that exposing numerical values of page flags is not a very good
idea at all.  If we really want to expose this information it should
at least be in string form, although that is quite a bit of a maintaince
horror aswell.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

Christoph Hellwig wrote:

On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote:


Yeah good point ;) I just meant the wider "we".

With all the problems kprobes has, something like poking deep into
kernel internals seems like a good thing to use it for instead of
hardcoding that stuff into the kernel. If not, then why did we even
merge it in the first place?



It's very nice to poke deep into the kernel for development purposes.
For example for the spu scheduler work I'm doing currently I have
a module using kprobes (note the systemtap crap because it's big, bloated,
in and odd language, and does a lot of really wrong things in it's runtime).


OK, I pick systemtap because I don't know any better... but kprobes
is what I mean for the kernel interface.



This module allows me to put probes into various places in the scheduler
and writes them into a ringbuffer with timestampts allowing me to
trace what's going on there.  This is really neat.  Unfortunately it
breaks as soon as I do some major reshuffling because then the points
it hooks up to are not there anymore.  That's perfectly fine for my
setup, because _I_ know what I have to change when it breaks, and can
easily fix that.  Now imagine a similar module to trace pagecache activity
used by a third-party monitoring tool.  We now get a major change to
the pagecache (say to make it lockless), and the probes just break.  In
the est case it just doesn't work anymore, in the worst case it crashes
the kernel.  Now if the app vendor at least gave me their source I
could at least fix it to not crash, but there's a fair enough chance
they poke into bits that simply aren't there anymore.

Now if we have a proper user interface with real code behing it we can
have a defined interface.  If the interface is bad enough (or just too
lowlevel) we might have the last problem of stastistic that were there
once to go away aswell, but we can deal with that gracefully by declaring
parts of the stats volatile and make sure people don't mess with them.

To summarize, I really love kprobes to ease my debugging work, but using
it for any kind of production code is a total nightmare.


OK, well Matt's stuff that he needs doesn't have to be kprobes at all,
and yes if lots of people want the same thing then we could distribute
it with the kernel.

But at least make it into its own module with a debugfs interface or
something. I mean, exposing a PG_name-to-nr and page count pfn and flags
as a supposedly formal proc interface doesn't sound nice to me. Page
flags does not tell you what is going on in the VM, it gives you a tiny
window into "something". Between reading a /proc/pid/ pfn and finding
the pfn's page flags, it could be used for something completely different
anyway.



We could distribute some systemtap scripts, and even distribute some
basic useful ones like this sort of page info in the kernel source
tree.



We could not really distribute systemtap scripts with the kernel.
systemtap is a bloody complicated piece of shit outside the kernel
tree that breaks all the time we change kernel internals.  We could
provide useful kprobes modules, a proper tracing system (ltt-ng-lite)
and surrounding infrastucture.


OK ;)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
On Fri, Apr 13, 2007 at 08:51:42AM +0100, Christoph Hellwig wrote:
> Umm, folks.  systemtap basically means people compile kernel modules
> from an odd scripting language with embedded C snipplets into kernel
> modules.  The kernel modules don't use normal exported APIs but use
> kallsysms and dwarf info to poke into every possible private bit.
> Saying you don't care the slightest whether oracle will load huge
> amounts of code into the kernel that depends on intimate implementation
> details, and that you don't even have source to to debug it is not what
> I'd call "none of us need to care in the slightest", at least for those
> of you working for distributions that may actually have to debug this
> shit in the end.
> While we're at it, what happened to the idea of tainting the kernel
> as soon as krpobes are placed in the kernel to at least make people
> aware of it?

This is for a system monitoring app outside the database proper that
just happens to be done by the same .com as makes the database. It's
got little to do with the database itself apart from knowing how to
tell the database to e.g. let fewer clients in. The part that deals
with this is sort of like a custom procps that does things rather
specifically how they need them, including being portable to other
OS's IIRC, though the scope of the app is much larger than that.

They're actually quite concerned about issues of this sort since they
want to run all the time in the background in order to respond to
system conditions, though probably not necessarily rapid-fire sorts of
responses to second-by-second changes in conditions.

Basically, they're not a debugging affair, and they need to be able to
run in supported conditions. They're rather disinterested in things
that would, say, taint the kernel or take customers out of supported
configurations. They'll fall back to the known-grossly-inaccurate
RSS-based estimates they're using now in preference to such.

They don't want omnibus back doors into the kernel and I honestly
expect them to NAK the systemtap solution. They really want the
"uniquely attributable RSS" or "proportional RSS" reported directly,
and it takes some doing to convince them that this can't be done
directly for various reasons, e.g. floating point in the kernel won't
fly. They can program sufficiently well to maintain a database of pfn's,
pid's of processes mapping them, and user virtual addresses they're
mapped at (easy enough to kick off a database instance for it if they
don't feel comfortable just hashing the triples) and do the tabulation
from there, though they're not happy having to do so much of the
calculation themselves. Actually, I promised them reporting of mapcount
which would make per-process UARSS/PRSS calculation able to be done on
a process-by-process basis, though I can probably convince them to do
whole-system pfn-by-pfn tabulation if such is lacking.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote:
> Yeah good point ;) I just meant the wider "we".
> 
> With all the problems kprobes has, something like poking deep into
> kernel internals seems like a good thing to use it for instead of
> hardcoding that stuff into the kernel. If not, then why did we even
> merge it in the first place?

It's very nice to poke deep into the kernel for development purposes.
For example for the spu scheduler work I'm doing currently I have
a module using kprobes (note the systemtap crap because it's big, bloated,
in and odd language, and does a lot of really wrong things in it's runtime).

This module allows me to put probes into various places in the scheduler
and writes them into a ringbuffer with timestampts allowing me to
trace what's going on there.  This is really neat.  Unfortunately it
breaks as soon as I do some major reshuffling because then the points
it hooks up to are not there anymore.  That's perfectly fine for my
setup, because _I_ know what I have to change when it breaks, and can
easily fix that.  Now imagine a similar module to trace pagecache activity
used by a third-party monitoring tool.  We now get a major change to
the pagecache (say to make it lockless), and the probes just break.  In
the est case it just doesn't work anymore, in the worst case it crashes
the kernel.  Now if the app vendor at least gave me their source I
could at least fix it to not crash, but there's a fair enough chance
they poke into bits that simply aren't there anymore.

Now if we have a proper user interface with real code behing it we can
have a defined interface.  If the interface is bad enough (or just too
lowlevel) we might have the last problem of stastistic that were there
once to go away aswell, but we can deal with that gracefully by declaring
parts of the stats volatile and make sure people don't mess with them.

To summarize, I really love kprobes to ease my debugging work, but using
it for any kind of production code is a total nightmare.

> We could distribute some systemtap scripts, and even distribute some
> basic useful ones like this sort of page info in the kernel source
> tree.

We could not really distribute systemtap scripts with the kernel.
systemtap is a bloody complicated piece of shit outside the kernel
tree that breaks all the time we change kernel internals.  We could
provide useful kprobes modules, a proper tracing system (ltt-ng-lite)
and surrounding infrastucture.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

Christoph Hellwig wrote:

On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote:


Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like
and none of us need to care in the slightest ;)



Umm, folks.  systemtap basically means people compile kernel modules
from an odd scripting language with embedded C snipplets into kernel
modules.  The kernel modules don't use normal exported APIs but use
kallsysms and dwarf info to poke into every possible private bit.

Saying you don't care the slightest whether oracle will load huge
amounts of code into the kernel that depends on intimate implementation
details, and that you don't even have source to to debug it is not what
I'd call "none of us need to care in the slightest", at least for those
of you working for distributions that may actually have to debug this
shit in the end.


Yeah good point ;) I just meant the wider "we".

With all the problems kprobes has, something like poking deep into
kernel internals seems like a good thing to use it for instead of
hardcoding that stuff into the kernel. If not, then why did we even
merge it in the first place?

We could distribute some systemtap scripts, and even distribute some
basic useful ones like this sort of page info in the kernel source
tree.



While we're at it, what happened to the idea of tainting the kernel
as soon as krpobes are placed in the kernel to at least make people
aware of it?


Seems like a very good idea.


--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote:
> Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like
> and none of us need to care in the slightest ;)

Umm, folks.  systemtap basically means people compile kernel modules
from an odd scripting language with embedded C snipplets into kernel
modules.  The kernel modules don't use normal exported APIs but use
kallsysms and dwarf info to poke into every possible private bit.

Saying you don't care the slightest whether oracle will load huge
amounts of code into the kernel that depends on intimate implementation
details, and that you don't even have source to to debug it is not what
I'd call "none of us need to care in the slightest", at least for those
of you working for distributions that may actually have to debug this
shit in the end.

While we're at it, what happened to the idea of tainting the kernel
as soon as krpobes are placed in the kernel to at least make people
aware of it?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> The EM guys are unwilling or unable for support-oriented reasons to
>> deal with anything but unmodified kernels as shipped by distros.

On Fri, Apr 13, 2007 at 05:03:43PM +1000, Nick Piggin wrote:
> And I think major distros ship with kprobes enabled, so that is yet
> another reason why systemtap should be considered before adding these
> proc interfaces.

I'll have to check in and see if that will work for them. A lot of this
is about customer/distro/support interaction constraints on how it works
as opposed to purely technical affairs.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

William Lee Irwin III wrote:

Andrew Morton wrote:


Then you just end up with the same thing, don't you?



On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:


Well _you_ do, because that happens to be exactly what you want. Bill
ends up with something that displays page_mapcount instead. And I
end up with something that traverses LRU lists rather than pfns. And
none of it goes in /proc/ or linux-2.6/.
So it isn't really the same thing at all.



The EM guys aren't dealing with the database; they're dealing with some
enterprise management thingie that does things like control how many
client connections are allowed for each database instance. Unless
they're doing less than I expect, and are largely something like procps
on steroids and enterprise silliness.


Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like
and none of us need to care in the slightest ;)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

William Lee Irwin III wrote:

Andrew Morton wrote:


Do a full pagetable walk, with all the associated locking from within
a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
in C, perhaps.  Then you just end up with the same thing, don't you?



On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:


And my problem isn't with the hardcoded pagetable walker. Yeah, we'd
probably still keep the pagetable callback walker thingy with Matt's
associated cleanups (and my subsequent ones to clean it up more and
move it to mm/): there are other in-kernel users for that anyway.
The point is the proc API, and exposing random little parts of deep
kernel internals that some people happen to find useful at the time.
(which is why we have an incredible proliferation of these things).
With systemtap scripts, you could walk pagetables and print *the exact
page information you want*, or you could walk pfns, or LRU, or page_tree,
or walk the page tree then the rmap structures. And you can selectively
cull out items you don't care about if you only care about a subset of
items, based on arbitrary criteria. And you can most likely do all that
more efficiently than with a conglomeration of various /proc files
(assuming they even provide what you want in the first place).



The EM guys are unwilling or unable for support-oriented reasons to
deal with anything but unmodified kernels as shipped by distros.


And I think major distros ship with kprobes enabled, so that is yet
another reason why systemtap should be considered before adding these
proc interfaces.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
Andrew Morton wrote:
>> Do a full pagetable walk, with all the associated locking from within
>> a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
>> in C, perhaps.  Then you just end up with the same thing, don't you?

On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:
> And my problem isn't with the hardcoded pagetable walker. Yeah, we'd
> probably still keep the pagetable callback walker thingy with Matt's
> associated cleanups (and my subsequent ones to clean it up more and
> move it to mm/): there are other in-kernel users for that anyway.
> The point is the proc API, and exposing random little parts of deep
> kernel internals that some people happen to find useful at the time.
> (which is why we have an incredible proliferation of these things).
> With systemtap scripts, you could walk pagetables and print *the exact
> page information you want*, or you could walk pfns, or LRU, or page_tree,
> or walk the page tree then the rmap structures. And you can selectively
> cull out items you don't care about if you only care about a subset of
> items, based on arbitrary criteria. And you can most likely do all that
> more efficiently than with a conglomeration of various /proc files
> (assuming they even provide what you want in the first place).

The EM guys are unwilling or unable for support-oriented reasons to
deal with anything but unmodified kernels as shipped by distros.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
Andrew Morton wrote:
>> Then you just end up with the same thing, don't you?

On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:
> Well _you_ do, because that happens to be exactly what you want. Bill
> ends up with something that displays page_mapcount instead. And I
> end up with something that traverses LRU lists rather than pfns. And
> none of it goes in /proc/ or linux-2.6/.
> So it isn't really the same thing at all.

The EM guys aren't dealing with the database; they're dealing with some
enterprise management thingie that does things like control how many
client connections are allowed for each database instance. Unless
they're doing less than I expect, and are largely something like procps
on steroids and enterprise silliness.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
Andrew Morton wrote:
 Then you just end up with the same thing, don't you?

On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:
 Well _you_ do, because that happens to be exactly what you want. Bill
 ends up with something that displays page_mapcount instead. And I
 end up with something that traverses LRU lists rather than pfns. And
 none of it goes in /proc/ or linux-2.6/.
 So it isn't really the same thing at all.

The EM guys aren't dealing with the database; they're dealing with some
enterprise management thingie that does things like control how many
client connections are allowed for each database instance. Unless
they're doing less than I expect, and are largely something like procps
on steroids and enterprise silliness.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
Andrew Morton wrote:
 Do a full pagetable walk, with all the associated locking from within
 a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
 in C, perhaps.  Then you just end up with the same thing, don't you?

On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:
 And my problem isn't with the hardcoded pagetable walker. Yeah, we'd
 probably still keep the pagetable callback walker thingy with Matt's
 associated cleanups (and my subsequent ones to clean it up more and
 move it to mm/): there are other in-kernel users for that anyway.
 The point is the proc API, and exposing random little parts of deep
 kernel internals that some people happen to find useful at the time.
 (which is why we have an incredible proliferation of these things).
 With systemtap scripts, you could walk pagetables and print *the exact
 page information you want*, or you could walk pfns, or LRU, or page_tree,
 or walk the page tree then the rmap structures. And you can selectively
 cull out items you don't care about if you only care about a subset of
 items, based on arbitrary criteria. And you can most likely do all that
 more efficiently than with a conglomeration of various /proc files
 (assuming they even provide what you want in the first place).

The EM guys are unwilling or unable for support-oriented reasons to
deal with anything but unmodified kernels as shipped by distros.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

William Lee Irwin III wrote:

Andrew Morton wrote:


Do a full pagetable walk, with all the associated locking from within
a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
in C, perhaps.  Then you just end up with the same thing, don't you?



On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:


And my problem isn't with the hardcoded pagetable walker. Yeah, we'd
probably still keep the pagetable callback walker thingy with Matt's
associated cleanups (and my subsequent ones to clean it up more and
move it to mm/): there are other in-kernel users for that anyway.
The point is the proc API, and exposing random little parts of deep
kernel internals that some people happen to find useful at the time.
(which is why we have an incredible proliferation of these things).
With systemtap scripts, you could walk pagetables and print *the exact
page information you want*, or you could walk pfns, or LRU, or page_tree,
or walk the page tree then the rmap structures. And you can selectively
cull out items you don't care about if you only care about a subset of
items, based on arbitrary criteria. And you can most likely do all that
more efficiently than with a conglomeration of various /proc files
(assuming they even provide what you want in the first place).



The EM guys are unwilling or unable for support-oriented reasons to
deal with anything but unmodified kernels as shipped by distros.


And I think major distros ship with kprobes enabled, so that is yet
another reason why systemtap should be considered before adding these
proc interfaces.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

William Lee Irwin III wrote:

Andrew Morton wrote:


Then you just end up with the same thing, don't you?



On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:


Well _you_ do, because that happens to be exactly what you want. Bill
ends up with something that displays page_mapcount instead. And I
end up with something that traverses LRU lists rather than pfns. And
none of it goes in /proc/ or linux-2.6/.
So it isn't really the same thing at all.



The EM guys aren't dealing with the database; they're dealing with some
enterprise management thingie that does things like control how many
client connections are allowed for each database instance. Unless
they're doing less than I expect, and are largely something like procps
on steroids and enterprise silliness.


Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like
and none of us need to care in the slightest ;)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
William Lee Irwin III wrote:
 The EM guys are unwilling or unable for support-oriented reasons to
 deal with anything but unmodified kernels as shipped by distros.

On Fri, Apr 13, 2007 at 05:03:43PM +1000, Nick Piggin wrote:
 And I think major distros ship with kprobes enabled, so that is yet
 another reason why systemtap should be considered before adding these
 proc interfaces.

I'll have to check in and see if that will work for them. A lot of this
is about customer/distro/support interaction constraints on how it works
as opposed to purely technical affairs.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote:
 Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like
 and none of us need to care in the slightest ;)

Umm, folks.  systemtap basically means people compile kernel modules
from an odd scripting language with embedded C snipplets into kernel
modules.  The kernel modules don't use normal exported APIs but use
kallsysms and dwarf info to poke into every possible private bit.

Saying you don't care the slightest whether oracle will load huge
amounts of code into the kernel that depends on intimate implementation
details, and that you don't even have source to to debug it is not what
I'd call none of us need to care in the slightest, at least for those
of you working for distributions that may actually have to debug this
shit in the end.

While we're at it, what happened to the idea of tainting the kernel
as soon as krpobes are placed in the kernel to at least make people
aware of it?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

Christoph Hellwig wrote:

On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote:


Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like
and none of us need to care in the slightest ;)



Umm, folks.  systemtap basically means people compile kernel modules
from an odd scripting language with embedded C snipplets into kernel
modules.  The kernel modules don't use normal exported APIs but use
kallsysms and dwarf info to poke into every possible private bit.

Saying you don't care the slightest whether oracle will load huge
amounts of code into the kernel that depends on intimate implementation
details, and that you don't even have source to to debug it is not what
I'd call none of us need to care in the slightest, at least for those
of you working for distributions that may actually have to debug this
shit in the end.


Yeah good point ;) I just meant the wider we.

With all the problems kprobes has, something like poking deep into
kernel internals seems like a good thing to use it for instead of
hardcoding that stuff into the kernel. If not, then why did we even
merge it in the first place?

We could distribute some systemtap scripts, and even distribute some
basic useful ones like this sort of page info in the kernel source
tree.



While we're at it, what happened to the idea of tainting the kernel
as soon as krpobes are placed in the kernel to at least make people
aware of it?


Seems like a very good idea.


--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote:
 Yeah good point ;) I just meant the wider we.
 
 With all the problems kprobes has, something like poking deep into
 kernel internals seems like a good thing to use it for instead of
 hardcoding that stuff into the kernel. If not, then why did we even
 merge it in the first place?

It's very nice to poke deep into the kernel for development purposes.
For example for the spu scheduler work I'm doing currently I have
a module using kprobes (note the systemtap crap because it's big, bloated,
in and odd language, and does a lot of really wrong things in it's runtime).

This module allows me to put probes into various places in the scheduler
and writes them into a ringbuffer with timestampts allowing me to
trace what's going on there.  This is really neat.  Unfortunately it
breaks as soon as I do some major reshuffling because then the points
it hooks up to are not there anymore.  That's perfectly fine for my
setup, because _I_ know what I have to change when it breaks, and can
easily fix that.  Now imagine a similar module to trace pagecache activity
used by a third-party monitoring tool.  We now get a major change to
the pagecache (say to make it lockless), and the probes just break.  In
the est case it just doesn't work anymore, in the worst case it crashes
the kernel.  Now if the app vendor at least gave me their source I
could at least fix it to not crash, but there's a fair enough chance
they poke into bits that simply aren't there anymore.

Now if we have a proper user interface with real code behing it we can
have a defined interface.  If the interface is bad enough (or just too
lowlevel) we might have the last problem of stastistic that were there
once to go away aswell, but we can deal with that gracefully by declaring
parts of the stats volatile and make sure people don't mess with them.

To summarize, I really love kprobes to ease my debugging work, but using
it for any kind of production code is a total nightmare.

 We could distribute some systemtap scripts, and even distribute some
 basic useful ones like this sort of page info in the kernel source
 tree.

We could not really distribute systemtap scripts with the kernel.
systemtap is a bloody complicated piece of shit outside the kernel
tree that breaks all the time we change kernel internals.  We could
provide useful kprobes modules, a proper tracing system (ltt-ng-lite)
and surrounding infrastucture.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread William Lee Irwin III
On Fri, Apr 13, 2007 at 08:51:42AM +0100, Christoph Hellwig wrote:
 Umm, folks.  systemtap basically means people compile kernel modules
 from an odd scripting language with embedded C snipplets into kernel
 modules.  The kernel modules don't use normal exported APIs but use
 kallsysms and dwarf info to poke into every possible private bit.
 Saying you don't care the slightest whether oracle will load huge
 amounts of code into the kernel that depends on intimate implementation
 details, and that you don't even have source to to debug it is not what
 I'd call none of us need to care in the slightest, at least for those
 of you working for distributions that may actually have to debug this
 shit in the end.
 While we're at it, what happened to the idea of tainting the kernel
 as soon as krpobes are placed in the kernel to at least make people
 aware of it?

This is for a system monitoring app outside the database proper that
just happens to be done by the same .com as makes the database. It's
got little to do with the database itself apart from knowing how to
tell the database to e.g. let fewer clients in. The part that deals
with this is sort of like a custom procps that does things rather
specifically how they need them, including being portable to other
OS's IIRC, though the scope of the app is much larger than that.

They're actually quite concerned about issues of this sort since they
want to run all the time in the background in order to respond to
system conditions, though probably not necessarily rapid-fire sorts of
responses to second-by-second changes in conditions.

Basically, they're not a debugging affair, and they need to be able to
run in supported conditions. They're rather disinterested in things
that would, say, taint the kernel or take customers out of supported
configurations. They'll fall back to the known-grossly-inaccurate
RSS-based estimates they're using now in preference to such.

They don't want omnibus back doors into the kernel and I honestly
expect them to NAK the systemtap solution. They really want the
uniquely attributable RSS or proportional RSS reported directly,
and it takes some doing to convince them that this can't be done
directly for various reasons, e.g. floating point in the kernel won't
fly. They can program sufficiently well to maintain a database of pfn's,
pid's of processes mapping them, and user virtual addresses they're
mapped at (easy enough to kick off a database instance for it if they
don't feel comfortable just hashing the triples) and do the tabulation
from there, though they're not happy having to do so much of the
calculation themselves. Actually, I promised them reporting of mapcount
which would make per-process UARSS/PRSS calculation able to be done on
a process-by-process basis, though I can probably convince them to do
whole-system pfn-by-pfn tabulation if such is lacking.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

Christoph Hellwig wrote:

On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote:


Yeah good point ;) I just meant the wider we.

With all the problems kprobes has, something like poking deep into
kernel internals seems like a good thing to use it for instead of
hardcoding that stuff into the kernel. If not, then why did we even
merge it in the first place?



It's very nice to poke deep into the kernel for development purposes.
For example for the spu scheduler work I'm doing currently I have
a module using kprobes (note the systemtap crap because it's big, bloated,
in and odd language, and does a lot of really wrong things in it's runtime).


OK, I pick systemtap because I don't know any better... but kprobes
is what I mean for the kernel interface.



This module allows me to put probes into various places in the scheduler
and writes them into a ringbuffer with timestampts allowing me to
trace what's going on there.  This is really neat.  Unfortunately it
breaks as soon as I do some major reshuffling because then the points
it hooks up to are not there anymore.  That's perfectly fine for my
setup, because _I_ know what I have to change when it breaks, and can
easily fix that.  Now imagine a similar module to trace pagecache activity
used by a third-party monitoring tool.  We now get a major change to
the pagecache (say to make it lockless), and the probes just break.  In
the est case it just doesn't work anymore, in the worst case it crashes
the kernel.  Now if the app vendor at least gave me their source I
could at least fix it to not crash, but there's a fair enough chance
they poke into bits that simply aren't there anymore.

Now if we have a proper user interface with real code behing it we can
have a defined interface.  If the interface is bad enough (or just too
lowlevel) we might have the last problem of stastistic that were there
once to go away aswell, but we can deal with that gracefully by declaring
parts of the stats volatile and make sure people don't mess with them.

To summarize, I really love kprobes to ease my debugging work, but using
it for any kind of production code is a total nightmare.


OK, well Matt's stuff that he needs doesn't have to be kprobes at all,
and yes if lots of people want the same thing then we could distribute
it with the kernel.

But at least make it into its own module with a debugfs interface or
something. I mean, exposing a PG_name-to-nr and page count pfn and flags
as a supposedly formal proc interface doesn't sound nice to me. Page
flags does not tell you what is going on in the VM, it gives you a tiny
window into something. Between reading a /proc/pid/ pfn and finding
the pfn's page flags, it could be used for something completely different
anyway.



We could distribute some systemtap scripts, and even distribute some
basic useful ones like this sort of page info in the kernel source
tree.



We could not really distribute systemtap scripts with the kernel.
systemtap is a bloody complicated piece of shit outside the kernel
tree that breaks all the time we change kernel internals.  We could
provide useful kprobes modules, a proper tracing system (ltt-ng-lite)
and surrounding infrastucture.


OK ;)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2007 at 06:25:46PM +1000, Nick Piggin wrote:
 But at least make it into its own module with a debugfs interface or
 something. I mean, exposing a PG_name-to-nr and page count pfn and flags
 as a supposedly formal proc interface doesn't sound nice to me. Page
 flags does not tell you what is going on in the VM, it gives you a tiny
 window into something. Between reading a /proc/pid/ pfn and finding
 the pfn's page flags, it could be used for something completely different
 anyway.

I agree that exposing numerical values of page flags is not a very good
idea at all.  If we really want to expose this information it should
at least be in string form, although that is quite a bit of a maintaince
horror aswell.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Ananth N Mavinakayanahalli
On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:
 Andrew Morton wrote:
 On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin [EMAIL PROTECTED] 
 wrote:
 
 
 I guess one could generate an answer to the static question with 
 systemtap,
 by accumulating running counts across the application lifetime and then
 snapshotting them.  Sounds hard though.
 
 Can't you just traverse arbitrary kernel data structures at a given point
 in time, exactly like the /proc/ call is doing?
 
 
 Do a full pagetable walk, with all the associated locking from within
 a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
 in C, perhaps.
 
 It looks like you can traverse arbitrary data structures, yes.
 
 It definitely seems like you can use some kernel functions, but the
 ones I saw may just be systemtap facilities. But what is so surprising
 about being able to call a kernel function when running in kernel
 context? Perhaps there is some fundamental limitation of kprobes that
 I don't understand.

The main requirement for kprobes handlers is that they can't sleep. You
could definitely call a kernel function from kprobe handlers as long as
the function doesn't sleep.

Ananth
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Ananth N Mavinakayanahalli
On Fri, Apr 13, 2007 at 12:54:36PM +1000, Nick Piggin wrote:
 Matt Mackall wrote:
 On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote:
 
 Matt Mackall wrote:
 
 On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
 
 If kprobes is simply crappy and doesn't work properly for this, then I
 could accept that. I'm not someone trying to get this info. So why can't
 it be used? (not just for kpagemap, but for clear_refs and all that gunk
 too).
 
 
 kprobes is good for looking at events, but bad for looking at state.
 Especially metric shitloads of state.
 
 Why? Why is a kprobes trap significantly more expensive than a read
 syscall?
 
 
 I guess I'm not clear on what you're proposing. From my understanding
 of kprobes (admittedly not an expert), this is hard to do and not a
 very good match.
 
 But you have an idea that it is bad for exposing lots of data. Why?
 (I'm not a kprobes expert either, these are not rhetorical questions)

You could tie your kprobe module to use relay channels. Kprobe handlers
run lockless and using the per-cpu relay channels will provide a fast
transport mechanism for exposing lots of data.

http://relayfs.sourceforge.net/examples.html#tprintk_kprobes is an
example using the earlier relayfs interface. It shouldn't be that hard
to change it to use the newer relay stuff.

AFAIK acme is using a similar mechanism for ctracer
(http://oops.ghostprotocols.net:81/blog/?p=50)

Ananth
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Nick Piggin

Ananth N Mavinakayanahalli wrote:

On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote:



It definitely seems like you can use some kernel functions, but the
ones I saw may just be systemtap facilities. But what is so surprising
about being able to call a kernel function when running in kernel
context? Perhaps there is some fundamental limitation of kprobes that
I don't understand.



The main requirement for kprobes handlers is that they can't sleep. You
could definitely call a kernel function from kprobe handlers as long as
the function doesn't sleep.


That would be enough to access basically all the VM data structures.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Theodore Tso
On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote:
 With systemtap scripts, you could walk pagetables and print *the exact
 page information you want*, or you could walk pfns, or LRU, or page_tree,
 or walk the page tree then the rmap structures. And you can selectively
 cull out items you don't care about if you only care about a subset of
 items, based on arbitrary criteria. And you can most likely do all that
 more efficiently than with a conglomeration of various /proc files
 (assuming they even provide what you want in the first place).

Yes, but maintaining the systemtap scripts will be a nightmare, since
they would be outside the kernel, and as we change our internal data
structure, the scripts would become useless.

This is a fundamental problem with systemtap that we haven't been able
to solve yet, because solving it would freeze various internal data
structures or kernel functions.  I agree that's not acceptable; which
is why I don't think systemtap would be a good match for the problem
we're trying to solve here.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Matt Mackall
On Thu, Apr 12, 2007 at 05:42:01PM -0700, Andrew Morton wrote:
 On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote:
 
  +   ((char *)page)[1] = PAGE_SHIFT;
   
   
   OK.
  
  Shouldn't we just expose page size and endianness by other means? (another 
  file or
  syscall).
 
 I don't think so - this file exposes fairly deep kernel internals and
 that's unavoidable, really - it's *supposed* to do that.  It is explicitly
 designed for monitoring kernel behaviour.
 
 So it needs special handling by userspace.  Keeping the number of files
 which need such special handling to a minimum will keep the number of
 applications which are exposed to kernel changes to a minimum.
 
  +   for (; i  2 * chunk / KPMSIZE; i += 2, pfn++) {
  +   ppage = pfn_to_page(pfn);
  +   if (!ppage) {
  +   page[i] = 0;
  +   page[i + 1] = 0;
  +   } else {
  +   page[i] = ppage-flags;
  +   page[i + 1] = atomic_read(ppage-_count);
  +   }
  +   }
   
   
   Not a good idea to expose raw flags in this manner - it changes at the 
   drop
   of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
   mapping to make this viable.
  
  I don't think it is viable because that makes the flags part of the
  userspace ABI.
 
 It *will* be viable.  If the application wants to know if a page is dirty,
 it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
 numerical offset when inspecting fields in /proc/kpagemap.  If correctly
 designed, such a monitoring application will be able to report upon page
 flags which we haven't even thought up yet.

We can probably fit this in the existing (variable-sized) header.
 
  I wonder what they are needed for.
 
 Poking deeply into the kernel to provide information about kernel state. 
 
 There are real-world needs for this, and the people who develop tools to
 process this information will have decent kernel understanding and will
 know that the file's contents may alter across kernel versions.  It sure
 beats poking around in /dev/kmem.
 
 I doubt if there's a sensible way in which we can prettify this interface
 without losing information.  But we should aim to make it as robust as
 possible agaisnt future kenrel changes, of course.
 
 And we should satisfy ourselves that all the required information has been
 made available.  The fact that it will satisfy the Oracle requirement is
 encouraging.
 
 Matt, these changes make the new field in /proc/pid/smaps redundant, don't
 they?

Which new field? From /proc/kpagemap + /proc/*/pagemap, you can
basically synthesize any statistic you want, including all the
existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
be considerably more efficient.

But in general, most of the statistics in smaps are basically useless
for shared mappings, just like RSS. Problem is, we really don't know
what statistics we want yet, or even if it can be distilled down to
simple numbers anyway.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Andrew Morton
On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall [EMAIL PROTECTED] wrote:

  It *will* be viable.  If the application wants to know if a page is dirty,
  it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
  numerical offset when inspecting fields in /proc/kpagemap.  If correctly
  designed, such a monitoring application will be able to report upon page
  flags which we haven't even thought up yet.
 
 We can probably fit this in the existing (variable-sized) header.

hm, OK..

   I wonder what they are needed for.
  
  Poking deeply into the kernel to provide information about kernel state. 
  
  There are real-world needs for this, and the people who develop tools to
  process this information will have decent kernel understanding and will
  know that the file's contents may alter across kernel versions.  It sure
  beats poking around in /dev/kmem.
  
  I doubt if there's a sensible way in which we can prettify this interface
  without losing information.  But we should aim to make it as robust as
  possible agaisnt future kenrel changes, of course.
  
  And we should satisfy ourselves that all the required information has been
  made available.  The fact that it will satisfy the Oracle requirement is
  encouraging.
  
  Matt, these changes make the new field in /proc/pid/smaps redundant, don't
  they?
 
 Which new field?

Referenced:

 From /proc/kpagemap + /proc/*/pagemap, you can
 basically synthesize any statistic you want, including all the
 existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
 be considerably more efficient.

You'd need to poke clear_refs beforehand to make the referenced bits useful.

Actually, we also need to run around the ptes and collect the pte-referenced
bits too.  I don't think your code copes with any of that?
 
 But in general, most of the statistics in smaps are basically useless
 for shared mappings, just like RSS. Problem is, we really don't know
 what statistics we want yet, or even if it can be distilled down to
 simple numbers anyway.

yup.  But that's the whole point, really: don't prejudge what info userspace
is trying to collect.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Matt Mackall
On Fri, Apr 13, 2007 at 12:18:56PM +1000, Nick Piggin wrote:
 Can't you just traverse arbitrary kernel data structures at a given point
 in time, exactly like the /proc/ call is doing?

Perhaps.

My understanding is that you hook a kprobe to an event. An event is a
particular instruction getting executed. Indeed, you can do whatever
poking around in the kernel you want at that point. And then you can
stuff that data in a buffer that eventually gets to userspace.

This is very different from a read/seek/syscall. Rather than just
asking the kernel for some data, we have to wait for the relevent
events. Now, of course, you can make an ugly hack like hooking
sys_getpid() and basically make your own system call. Hopefully no one
else will call getpid() while you're doing this, etc. Not really how
it's intended to work at all, and probably a bitch to use, but
possible. Then the question becomes: why don't we do this for
everything else in /proc?

And the answer of course is: we put stuff in /proc because it's
generally useful. Extra points if it's actually related to
'proc'esses. Being able to tell what's paged in in a given mapping is
useful. Being able to tell what's shared between two mappings is
useful. Being able to get an accurate, meaningful picture of how your
memory is being used is useful. Heck, I bet some people might find it
useful to be able to see what nodes the pages in their process are on.
All stuff you shouldn't need to be a kernel hacker to answer.

The flags part of /proc/kpagemap exposes some (very interesting!)
implementation details. The rest of it is completely generic to any
system with a VM. It's only deep kernel magic in the sense that it's
not yet exposed.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Matt Mackall
On Fri, Apr 13, 2007 at 10:03:56AM -0700, Andrew Morton wrote:
 On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall [EMAIL PROTECTED] wrote:
 
   It *will* be viable.  If the application wants to know if a page is dirty,
   it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
   numerical offset when inspecting fields in /proc/kpagemap.  If correctly
   designed, such a monitoring application will be able to report upon page
   flags which we haven't even thought up yet.
  
  We can probably fit this in the existing (variable-sized) header.
 
 hm, OK..
 
I wonder what they are needed for.
   
   Poking deeply into the kernel to provide information about kernel state. 
   
   There are real-world needs for this, and the people who develop tools to
   process this information will have decent kernel understanding and will
   know that the file's contents may alter across kernel versions.  It sure
   beats poking around in /dev/kmem.
   
   I doubt if there's a sensible way in which we can prettify this interface
   without losing information.  But we should aim to make it as robust as
   possible agaisnt future kenrel changes, of course.
   
   And we should satisfy ourselves that all the required information has been
   made available.  The fact that it will satisfy the Oracle requirement is
   encouraging.
   
   Matt, these changes make the new field in /proc/pid/smaps redundant, don't
   they?
  
  Which new field?
 
 Referenced:
 
  From /proc/kpagemap + /proc/*/pagemap, you can
  basically synthesize any statistic you want, including all the
  existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
  be considerably more efficient.
 
 You'd need to poke clear_refs beforehand to make the referenced bits useful.
 
 Actually, we also need to run around the ptes and collect the pte-referenced
 bits too.  I don't think your code copes with any of that?

No, and it probably should. Perhaps dirty as well, though I've kindof
lost the plot on how that works lately.

  But in general, most of the statistics in smaps are basically useless
  for shared mappings, just like RSS. Problem is, we really don't know
  what statistics we want yet, or even if it can be distilled down to
  simple numbers anyway.
 
 yup.  But that's the whole point, really: don't prejudge what info userspace
 is trying to collect.

Right.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Andrew Morton
On Fri, 13 Apr 2007 12:24:51 -0500 Matt Mackall [EMAIL PROTECTED] wrote:

   From /proc/kpagemap + /proc/*/pagemap, you can
   basically synthesize any statistic you want, including all the
   existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will
   be considerably more efficient.
  
  You'd need to poke clear_refs beforehand to make the referenced bits useful.
  
  Actually, we also need to run around the ptes and collect the pte-referenced
  bits too.  I don't think your code copes with any of that?
 
 No, and it probably should. Perhaps dirty as well, though I've kindof
 lost the plot on how that works lately.

Dirty is OK: the VM keeps pte-dirtiness and page-dirtiness in sync now.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-13 Thread Frank Ch. Eigler

Christoph Hellwig [EMAIL PROTECTED] writes:

 [...]
  merge it in the first place?
 
 It's very nice to poke deep into the kernel for development purposes.
 For example for the spu scheduler work I'm doing currently I have
 a module using kprobes (note the systemtap crap because it's big, bloated,
 in and odd language, and does a lot of really wrong things in its runtime).

It may be worthwhile to remind people that it is easy to use systemtap
only to the extent of automating the placement of kprobes: just to
perform the function-name/source-file/line-number triplet to PC
mapping.  They can use embedded-C code to do all the same stuff they'd
do with kprobes.  They are not obligated to write any odd script code
for probing logic, nor indeed use any of this really wrong runtime.

 This module allows me to put probes into various places in the scheduler
 and writes them into a ringbuffer with timestampts allowing me to
 trace what's going on there.  This is really neat.  [...]

Indeed, and we too try to make this simple  fast: a couple of lines.

 [...] To summarize, I really love kprobes to ease my debugging work,
 but using it for any kind of production code is a total nightmare.

But at some point, some interface needs to be fixed for a final
user-space tool.  Whether that interface fixing is performed by kernel
developers being more reluctant to rewrite basic things, or by
providing a proc interface, or maintaining a kprobes module does not
matter.  Someone will feel constrained, and someone will be liberated.

One neat thing about our systemtap tool is that, no matter what layer
such interfaces become fixed within, it can probably interface to
them.  If there is no fixed interface, it can go down to debugging
info.  If there are tracing hooks present, it can attach.  It can make
appear as unified the disparate standardization policies of different
subsystems.


  We could distribute some systemtap scripts, and even distribute some
  basic useful ones like this sort of page info in the kernel source
  tree.

 We could not really distribute systemtap scripts with the kernel.
 systemtap is a bloody complicated piece of [software] 

I don't know if that should be treated a compliment to our team, for
being able to work quickly on something that a full-grown kernel
developer finds bloody complicated.  Perhaps your information is
simply outdated.  Big  bloated?  We have several times asked for
specifics rather than smears - what about it?

 outside the kernel tree that breaks all the time we change kernel
 internals. [...]

That's begging the question.  If kernel folks are willing to maintain
some included systemtap-related code, then by definition it would not
break all the time.


- FChE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.


Can't you just traverse arbitrary kernel data structures at a given point
in time, exactly like the /proc/ call is doing?



Do a full pagetable walk, with all the associated locking from within
a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
in C, perhaps.  Then you just end up with the same thing, don't you?


And my problem isn't with the hardcoded pagetable walker. Yeah, we'd
probably still keep the pagetable callback walker thingy with Matt's
associated cleanups (and my subsequent ones to clean it up more and
move it to mm/): there are other in-kernel users for that anyway.

The point is the proc API, and exposing random little parts of deep
kernel internals that some people happen to find useful at the time.
(which is why we have an incredible proliferation of these things).

With systemtap scripts, you could walk pagetables and print *the exact
page information you want*, or you could walk pfns, or LRU, or page_tree,
or walk the page tree then the rmap structures. And you can selectively
cull out items you don't care about if you only care about a subset of
items, based on arbitrary criteria. And you can most likely do all that
more efficiently than with a conglomeration of various /proc files
(assuming they even provide what you want in the first place).

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Nick Piggin wrote:

Andrew Morton wrote:



 Then you just end up with the same thing, don't you?



Well _you_ do, because that happens to be exactly what you want. Bill
ends up with something that displays page_mapcount instead. And I
end up with something that traverses LRU lists rather than pfns. And
none of it goes in /proc/ or linux-2.6/.


Oh, and you get to change it without recompiling and rebooting your
kernel.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote:


Matt Mackall wrote:


On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:



If kprobes is simply crappy and doesn't work properly for this, then I
could accept that. I'm not someone trying to get this info. So why can't
it be used? (not just for kpagemap, but for clear_refs and all that gunk
too).



kprobes is good for looking at events, but bad for looking at state.
Especially metric shitloads of state.


Why? Why is a kprobes trap significantly more expensive than a read
syscall?



I guess I'm not clear on what you're proposing. From my understanding
of kprobes (admittedly not an expert), this is hard to do and not a
very good match.


But you have an idea that it is bad for exposing lots of data. Why?
(I'm not a kprobes expert either, these are not rhetorical questions)

From what it looks like, you can traverse data structures and copy data
back to userspace. Which is what makes me think it might be suitable
(or could be made suitable).



Maybe. How about LRU? Reclaim performance is bad, and you want to work out
which pages keep going off the end of it, or which pages keep getting
written out via it, or who's pages are on the active list, forcing mine
out.



Those are actually probably a good match for systemtap as they're all 
events.


Traverse the LRU? Which files do they belong to? What process maps them?



-ENOPARSE.


Basically, any "stuff" other than what you're exposing.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.


Can't you just traverse arbitrary kernel data structures at a given point
in time, exactly like the /proc/ call is doing?



Do a full pagetable walk, with all the associated locking from within
a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
in C, perhaps.


It looks like you can traverse arbitrary data structures, yes.

It definitely seems like you can use some kernel functions, but the
ones I saw may just be systemtap facilities. But what is so surprising
about being able to call a kernel function when running in kernel
context? Perhaps there is some fundamental limitation of kprobes that
I don't understand.


 Then you just end up with the same thing, don't you?


Well _you_ do, because that happens to be exactly what you want. Bill
ends up with something that displays page_mapcount instead. And I
end up with something that traverses LRU lists rather than pfns. And
none of it goes in /proc/ or linux-2.6/.

So it isn't really the same thing at all.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote:
> Matt Mackall wrote:
> >On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
> 
> >>If kprobes is simply crappy and doesn't work properly for this, then I
> >>could accept that. I'm not someone trying to get this info. So why can't
> >>it be used? (not just for kpagemap, but for clear_refs and all that gunk
> >>too).
> >
> >
> >kprobes is good for looking at events, but bad for looking at state.
> >Especially metric shitloads of state.
> 
> Why? Why is a kprobes trap significantly more expensive than a read
> syscall?

I guess I'm not clear on what you're proposing. From my understanding
of kprobes (admittedly not an expert), this is hard to do and not a
very good match.
 
> >>Maybe. How about LRU? Reclaim performance is bad, and you want to work out
> >>which pages keep going off the end of it, or which pages keep getting
> >>written out via it, or who's pages are on the active list, forcing mine
> >>out.
> >
> >
> >Those are actually probably a good match for systemtap as they're all 
> >events.
> 
> Traverse the LRU? Which files do they belong to? What process maps them?

-ENOPARSE.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

> > I guess one could generate an answer to the static question with systemtap,
> > by accumulating running counts across the application lifetime and then
> > snapshotting them.  Sounds hard though.
> 
> Can't you just traverse arbitrary kernel data structures at a given point
> in time, exactly like the /proc/ call is doing?

Do a full pagetable walk, with all the associated locking from within
a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
in C, perhaps.  Then you just end up with the same thing, don't you?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote:


I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.



You'd have to do it from boot onward to get a complete system image.
One way to look at it is that systemtap can give you the derivative of
the information, and you have to integrate it.


So everyone keeps saying.

Would you tell me why you can't just traverse the data structures
in the same way as your proc handler? From the systemtap example
scripts it seems like you can traverse arbitrary kernel data
structures.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:



If kprobes is simply crappy and doesn't work properly for this, then I
could accept that. I'm not someone trying to get this info. So why can't
it be used? (not just for kpagemap, but for clear_refs and all that gunk
too).



kprobes is good for looking at events, but bad for looking at state.
Especially metric shitloads of state.


Why? Why is a kprobes trap significantly more expensive than a read
syscall?


Maybe. How about LRU? Reclaim performance is bad, and you want to work out
which pages keep going off the end of it, or which pages keep getting
written out via it, or who's pages are on the active list, forcing mine
out.



Those are actually probably a good match for systemtap as they're all events.


Traverse the LRU? Which files do they belong to? What process maps them?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



Maybe. How about LRU? Reclaim performance is bad, and you want to work out
which pages keep going off the end of it, or which pages keep getting
written out via it, or who's pages are on the active list, forcing mine
out.



I guess we have static analysis versus dynamic.  The interfaces which Matt
is proposing are suited to answering the question "what is my memory being
used for" (static).  They're unlikely to be useful for answering the question
"what's happening in the VM" (dynamic).  Systemtap is probably better for the
dynamic analysis.


"what is my memory being used for *now*" ;)



I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.


Can't you just traverse arbitrary kernel data structures at a given point
in time, exactly like the /proc/ call is doing?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote:
> I guess one could generate an answer to the static question with systemtap,
> by accumulating running counts across the application lifetime and then
> snapshotting them.  Sounds hard though.

You'd have to do it from boot onward to get a complete system image.
One way to look at it is that systemtap can give you the derivative of
the information, and you have to integrate it.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote:


Basically: to show what the hell's going on in the VM.


kprobes / systemtap isn't good enough?



It's not really a good match to the kprobes model. I'm not interested
in events, per se. I don't want to need to know about every single
alloc/free of N different varieties integrated from boot onward to
build up an image of the state of the system. Instead, I want to take
snapshots of the state of the VM.


Systemtap can't output a large set of values?

Why can't you attach a kprobe to a dummy syscall, and from there
iterate over pgdat/zone/memmap and output what you want?

Actually I'm surprised that kind of data querying facility isn't
already in there (I haven't used it seriously though).



The main goal here is to be able to answer the question "where's my
memory going?". Currently you can't really give a good answer to that
question from userspace because of shared mappings, etc.

There are lots of secondary questions that follow on very quickly from
that, like "what parts of my shared mappings are or aren't shared, and
why?", "what's actually in my application's working set?" and "how much
of this crap can I ditch?".


I understand roughly what you want, and that you can't easily get
it from /proc currently. My question at this point is just why can
we not use systemtap.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
> >Instead, one says "what pages are being used by my application", then, for
> 
> That includes unmapped pagecache being used by my application, doesn't it?
> Maybe that's too hard to do via /proc so we forget about it...

It'd be really nice to have a window into the pagecache too. But I for
one couldn't come up with a sensible scheme for it.

> >each of those pages "what is that page's state".  So the first step is to
> >collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
> >use those pfns to look the individual pages up in /proc/kpagemap.
> 
> OK I realise you could do it that way, but systemtap can definitely be
> used as a tool for understanding application behaviour in the context of
> the kernel, I think? The purpose for it is so that various little bits
> of deep kernel internals do not have to be exposed on a case by case basis.
> 
> If kprobes is simply crappy and doesn't work properly for this, then I
> could accept that. I'm not someone trying to get this info. So why can't
> it be used? (not just for kpagemap, but for clear_refs and all that gunk
> too).

kprobes is good for looking at events, but bad for looking at state.
Especially metric shitloads of state.

> > If you really want to know "who is using page 123435" then you'd need to
> > search /proc/*/pagemap.  There are possibly legitimate reasons why an
> > application developer would want to at least pertially perform such an
> > operation ("who am I sharing with"), but I doubt if it's the common case.
> 
> Maybe. How about LRU? Reclaim performance is bad, and you want to work out
> which pages keep going off the end of it, or which pages keep getting
> written out via it, or who's pages are on the active list, forcing mine
> out.

Those are actually probably a good match for systemtap as they're all events.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > 
> >>Andrew Morton wrote:
> 
> >>>It *will* be viable.  If the application wants to know if a page is dirty,
> >>>it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
> >>>numerical offset when inspecting fields in /proc/kpagemap.  If correctly
> >>>designed, such a monitoring application will be able to report upon page
> >>>flags which we haven't even thought up yet.
> >>
> >>Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
> >>Still seems like a basically hit and miss affair to just use flags. What if
> >>you want to know the process mapping a page? With systemtap or something you
> >>could walk the rmap structures. What if you want to look at pages along the
> >>LRU list rather than per-pfn? What about connecting pages to inodes?
> > 
> > 
> > Well hang on.  This isn't a tool for understanding kernel behaviour.  It's
> > a tool for understanding applciation behaviour.
> > 
> > So one doesn't ask "who is mapping that page" - that's a kernel developer
> > thing.
> > 
> > Instead, one says "what pages are being used by my application", then, for
> 
> That includes unmapped pagecache being used by my application, doesn't it?
> Maybe that's too hard to do via /proc so we forget about it...

Yes, harder.  I'm hoping that sampling of /proc/pid/io can be used to
determine pagecache use sufficiently accurately.  I know of one large
hosting company who are using it ("BTW, we are making great use of
taskstats!!  Its GREAT!")

> 
> > each of those pages "what is that page's state".  So the first step is to
> > collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
> > use those pfns to look the individual pages up in /proc/kpagemap.
> 
> OK I realise you could do it that way, but systemtap can definitely be
> used as a tool for understanding application behaviour in the context of
> the kernel, I think? The purpose for it is so that various little bits
> of deep kernel internals do not have to be exposed on a case by case basis.
> 
> If kprobes is simply crappy and doesn't work properly for this, then I
> could accept that. I'm not someone trying to get this info. So why can't
> it be used? (not just for kpagemap, but for clear_refs and all that gunk
> too).
> 
>  > If you really want to know "who is using page 123435" then you'd need to
>  > search /proc/*/pagemap.  There are possibly legitimate reasons why an
>  > application developer would want to at least pertially perform such an
>  > operation ("who am I sharing with"), but I doubt if it's the common case.
> 
> Maybe. How about LRU? Reclaim performance is bad, and you want to work out
> which pages keep going off the end of it, or which pages keep getting
> written out via it, or who's pages are on the active list, forcing mine
> out.

I guess we have static analysis versus dynamic.  The interfaces which Matt
is proposing are suited to answering the question "what is my memory being
used for" (static).  They're unlikely to be useful for answering the question
"what's happening in the VM" (dynamic).  Systemtap is probably better for the
dynamic analysis.

I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote:
> >Basically: to show what the hell's going on in the VM.
> 
> kprobes / systemtap isn't good enough?

It's not really a good match to the kprobes model. I'm not interested
in events, per se. I don't want to need to know about every single
alloc/free of N different varieties integrated from boot onward to
build up an image of the state of the system. Instead, I want to take
snapshots of the state of the VM.

The main goal here is to be able to answer the question "where's my
memory going?". Currently you can't really give a good answer to that
question from userspace because of shared mappings, etc.

There are lots of secondary questions that follow on very quickly from
that, like "what parts of my shared mappings are or aren't shared, and
why?", "what's actually in my application's working set?" and "how much
of this crap can I ditch?".

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



Andrew Morton wrote:



It *will* be viable.  If the application wants to know if a page is dirty,
it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
numerical offset when inspecting fields in /proc/kpagemap.  If correctly
designed, such a monitoring application will be able to report upon page
flags which we haven't even thought up yet.


Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
Still seems like a basically hit and miss affair to just use flags. What if
you want to know the process mapping a page? With systemtap or something you
could walk the rmap structures. What if you want to look at pages along the
LRU list rather than per-pfn? What about connecting pages to inodes?



Well hang on.  This isn't a tool for understanding kernel behaviour.  It's
a tool for understanding applciation behaviour.

So one doesn't ask "who is mapping that page" - that's a kernel developer
thing.

Instead, one says "what pages are being used by my application", then, for


That includes unmapped pagecache being used by my application, doesn't it?
Maybe that's too hard to do via /proc so we forget about it...



each of those pages "what is that page's state".  So the first step is to
collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
use those pfns to look the individual pages up in /proc/kpagemap.


OK I realise you could do it that way, but systemtap can definitely be
used as a tool for understanding application behaviour in the context of
the kernel, I think? The purpose for it is so that various little bits
of deep kernel internals do not have to be exposed on a case by case basis.

If kprobes is simply crappy and doesn't work properly for this, then I
could accept that. I'm not someone trying to get this info. So why can't
it be used? (not just for kpagemap, but for clear_refs and all that gunk
too).

> If you really want to know "who is using page 123435" then you'd need to
> search /proc/*/pagemap.  There are possibly legitimate reasons why an
> application developer would want to at least pertially perform such an
> operation ("who am I sharing with"), but I doubt if it's the common case.

Maybe. How about LRU? Reclaim performance is bad, and you want to work out
which pages keep going off the end of it, or which pages keep getting
written out via it, or who's pages are on the active list, forcing mine
out.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> +   for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) {
> +   ppage = pfn_to_page(pfn);
> +   if (!ppage) {
> +   page[i] = 0;
> +   page[i + 1] = 0;
> +   } else {
> +   page[i] = ppage->flags;
> +   page[i + 1] = atomic_read(>_count);
> +   }
> +   }
> >>>
> >>>
> >>>Not a good idea to expose raw flags in this manner - it changes at the drop
> >>>of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
> >>>mapping to make this viable.
> >>
> >>I don't think it is viable because that makes the flags part of the
> >>userspace ABI.
> > 
> > 
> > It *will* be viable.  If the application wants to know if a page is dirty,
> > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
> > numerical offset when inspecting fields in /proc/kpagemap.  If correctly
> > designed, such a monitoring application will be able to report upon page
> > flags which we haven't even thought up yet.
> 
> Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
> Still seems like a basically hit and miss affair to just use flags. What if
> you want to know the process mapping a page? With systemtap or something you
> could walk the rmap structures. What if you want to look at pages along the
> LRU list rather than per-pfn? What about connecting pages to inodes?

Well hang on.  This isn't a tool for understanding kernel behaviour.  It's
a tool for understanding applciation behaviour.

So one doesn't ask "who is mapping that page" - that's a kernel developer
thing.

Instead, one says "what pages are being used by my application", then, for
each of those pages "what is that page's state".  So the first step is to
collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
use those pfns to look the individual pages up in /proc/kpagemap.

If you really want to know "who is using page 123435" then you'd need to
search /proc/*/pagemap.  There are possibly legitimate reasons why an
application developer would want to at least pertially perform such an
operation ("who am I sharing with"), but I doubt if it's the common case.

> 
> But I was going to say
> that satisfying an Oracle requirement is a good reason _not_ to merge it ;)
>

hm, yes, there's plenty of precedent for that.

> (I joke!)

I akpm!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:



+   for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) {
+   ppage = pfn_to_page(pfn);
+   if (!ppage) {
+   page[i] = 0;
+   page[i + 1] = 0;
+   } else {
+   page[i] = ppage->flags;
+   page[i + 1] = atomic_read(>_count);
+   }
+   }



Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.


I don't think it is viable because that makes the flags part of the
userspace ABI.



It *will* be viable.  If the application wants to know if a page is dirty,
it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
numerical offset when inspecting fields in /proc/kpagemap.  If correctly
designed, such a monitoring application will be able to report upon page
flags which we haven't even thought up yet.


Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
Still seems like a basically hit and miss affair to just use flags. What if
you want to know the process mapping a page? With systemtap or something you
could walk the rmap structures. What if you want to look at pages along the
LRU list rather than per-pfn? What about connecting pages to inodes?

I thought this type of deep poking was the whole reason the probles thingies
were merged. I'm saddened that they're no good for this. I thought it would
be an ideal usage :(



I wonder what they are needed for.



Poking deeply into the kernel to provide information about kernel state. 


There are real-world needs for this, and the people who develop tools to
process this information will have decent kernel understanding and will
know that the file's contents may alter across kernel versions.  It sure
beats poking around in /dev/kmem.

I doubt if there's a sensible way in which we can prettify this interface
without losing information.  But we should aim to make it as robust as
possible agaisnt future kenrel changes, of course.

And we should satisfy ourselves that all the required information has been
made available.  The fact that it will satisfy the Oracle requirement is
encouraging.


Yeah it is close, they need page_mapcount I think. But I was going to say
that satisfying an Oracle requirement is a good reason _not_ to merge it ;)
(I joke!)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote:


Andrew Morton wrote:


On Thu, 12 Apr 2007 16:10:50 -0700
William Lee Irwin III <[EMAIL PROTECTED]> wrote:



+   while (count > 0) {
+   chunk = min_t(size_t, count, PAGE_SIZE);
+   i = 0;
+
+   if (pfn == -1) {
+   page[0] = 0;
+   page[1] = 0;
+   ((char *)page)[0] = (ntohl(1) != 1);



OK.




+   ((char *)page)[1] = PAGE_SHIFT;



OK.


Shouldn't we just expose page size and endianness by other means? (another 
file or

syscall).



If I send you this file dumped from a random machine, you won't know
what to make of it.


That's a good reason ;)


I'm planning to write a trivial server to sit on, say, my embedded
target and spew this over the wire to a client. 




Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.


I don't think it is viable because that makes the flags part of the
userspace ABI. I wonder what they are needed for.



Basically: to show what the hell's going on in the VM.


kprobes / systemtap isn't good enough?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote:

> >>+   ((char *)page)[1] = PAGE_SHIFT;
> > 
> > 
> > OK.
> 
> Shouldn't we just expose page size and endianness by other means? (another 
> file or
> syscall).

I don't think so - this file exposes fairly deep kernel internals and
that's unavoidable, really - it's *supposed* to do that.  It is explicitly
designed for monitoring kernel behaviour.

So it needs special handling by userspace.  Keeping the number of files
which need such special handling to a minimum will keep the number of
applications which are exposed to kernel changes to a minimum.

> >>+   for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) {
> >>+   ppage = pfn_to_page(pfn);
> >>+   if (!ppage) {
> >>+   page[i] = 0;
> >>+   page[i + 1] = 0;
> >>+   } else {
> >>+   page[i] = ppage->flags;
> >>+   page[i + 1] = atomic_read(>_count);
> >>+   }
> >>+   }
> > 
> > 
> > Not a good idea to expose raw flags in this manner - it changes at the drop
> > of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
> > mapping to make this viable.
> 
> I don't think it is viable because that makes the flags part of the
> userspace ABI.

It *will* be viable.  If the application wants to know if a page is dirty,
it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's
numerical offset when inspecting fields in /proc/kpagemap.  If correctly
designed, such a monitoring application will be able to report upon page
flags which we haven't even thought up yet.

> I wonder what they are needed for.

Poking deeply into the kernel to provide information about kernel state. 

There are real-world needs for this, and the people who develop tools to
process this information will have decent kernel understanding and will
know that the file's contents may alter across kernel versions.  It sure
beats poking around in /dev/kmem.

I doubt if there's a sensible way in which we can prettify this interface
without losing information.  But we should aim to make it as robust as
possible agaisnt future kenrel changes, of course.

And we should satisfy ourselves that all the required information has been
made available.  The fact that it will satisfy the Oracle requirement is
encouraging.

Matt, these changes make the new field in /proc/pid/smaps redundant, don't
they?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote:
> Andrew Morton wrote:
> >On Thu, 12 Apr 2007 16:10:50 -0700
> >William Lee Irwin III <[EMAIL PROTECTED]> wrote:
> 
> >>+   while (count > 0) {
> >>+   chunk = min_t(size_t, count, PAGE_SIZE);
> >>+   i = 0;
> >>+
> >>+   if (pfn == -1) {
> >>+   page[0] = 0;
> >>+   page[1] = 0;
> >>+   ((char *)page)[0] = (ntohl(1) != 1);
> >
> >
> >OK.
> >
> >
> >>+   ((char *)page)[1] = PAGE_SHIFT;
> >
> >
> >OK.
> 
> Shouldn't we just expose page size and endianness by other means? (another 
> file or
> syscall).

If I send you this file dumped from a random machine, you won't know
what to make of it.

I'm planning to write a trivial server to sit on, say, my embedded
target and spew this over the wire to a client. 

> >Not a good idea to expose raw flags in this manner - it changes at the drop
> >of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
> >mapping to make this viable.
> 
> I don't think it is viable because that makes the flags part of the
> userspace ABI. I wonder what they are needed for.

Basically: to show what the hell's going on in the VM.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote:
> On Thu, 12 Apr 2007 16:10:50 -0700
> William Lee Irwin III <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote:
> > > This patch series introduces /proc/pid/pagemap and /proc/kpagemap,
> > > which allow detailed run-time examination of process memory usage at a
> > > page granularity.
> > > The first several patches whip the page-walking code introduced for
> > > /proc/pid/smaps and clear_refs into a more generic form, the next
> > > couple make those interfaces optional, and the last two introduce the
> > > new interfaces, also optional.
> > 
> > This solves a real-life problem for Oracle system monitoring software
> > (specifically EM). Among the tasks it must carry out is determining
> > per-process memory footprint of a set of cooperating tasks (i.e. Oracle
> > processes). RSS is inadequate for this due to page sharing; this work
> > provides sufficient information to determine what EM needs.
> 
> I'm still dying to see what the human-readable output from this
> thing looks like.

Still a work-in-progress. It's a monstrous amount of data and it
basically requires a GUI to really get a handle on. Here's a couple
apps I've been tinkering with (aka My First GTK Apps):

http://selenic.com/Screenshot-pagemap.png

That's a snapshot of a live-updating image of memory usage for a
running process (Galeon). Each pixel is a page. Each 32x32 block is
4MB. Mappings are dark red. Pages that are actually faulted in are
bright red. You can poke around in the memory map with the mouse and
highlight mappings (blue). And pages that get faulted in flash green
(hard to capture in a screenshot).

http://selenic.com/Screenshot-kpagemap.png

And that's a live-updating image of system-wide memory usage. Bright
red are pages with a count of 1, dark red are pages with higher
counts. Next is to visualize slab/page cache/buddy/active/lru data as
well as highlight changing pages.

This isn't terribly interesting yet. It can tell you things about page
cache usage and fragmentation and readahead and so on.

But correlating across the two sources, we'll be able to show
information like "what pages in a process are actually
shared/active/lru/etc." You can take it even further by correlating
the above data with symbol info from nm, /proc/pid/clear_refs, etc.

Also, something I immediately noticed on looking at the raw data
(cat /proc/`pidof`/pagemap | hexdump -C | less):

002c8fd0  ff ff ff ff ff ff ff ff  ff ff ff ff 6d f8 03 00 |m...|
002c8fe0  6c f8 03 00 b9 f8 03 00  6b f8 03 00 6a f8 03 00 |l...k...j...|
002c8ff0  b8 f8 03 00 69 f8 03 00  68 f8 03 00 b7 f8 03 00 |i...h...|
002c9000  67 f8 03 00 66 f8 03 00  b6 f8 03 00 65 f8 03 00 |g...f...e...|
002c9010  64 f8 03 00 b5 f8 03 00  63 f8 03 00 62 f8 03 00 |d...c...b...|
002c9020  b4 f8 03 00 61 f8 03 00  60 f8 03 00 b3 f8 03 00 |a...`...|
002c9030  7f f8 03 00 7e f8 03 00  b2 f8 03 00 7d f8 03 00 |~...}...|
002c9040  7c f8 03 00 b1 f8 03 00  5f f8 03 00 5e f8 03 00 ||..._...^...|
002c9050  b0 f8 03 00 5d f8 03 00  5c f8 03 00 af f8 03 00 |]...\...|

Most of the consecutive page frames are allocated in descending order
(6d 6c 6b 6a ...). That's pessimal for physical merging of block I/O.
Given that we theoretically fixed this long-standing problem in 2.6
but it's obviously still happening, it's clear that a little more
visibility into the VM would be useful.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

William Lee Irwin III wrote:

On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:


This solves a real-life problem for Oracle system monitoring software
(specifically EM). Among the tasks it must carry out is determining
per-process memory footprint of a set of cooperating tasks (i.e. Oracle
processes). RSS is inadequate for this due to page sharing; this work
provides sufficient information to determine what EM needs.



On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote:


Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.
Not a good idea to use page->_count: page_count() will be more stable. 
Otherwise OK, I guess: the interpretation of the page refcount is unlikely

to change much over time.



EM wants to determine page_mapcount() for the most part for the
purposes of determining "uniquely attributable RSS" (my ca. 2004
nomenclature) or "proportional RSS" (mpm's more recent nomenclature);
as things now stand it will have to infer them by maintaining a table
of pfn's and mappings thereof, but at least that can be done with it.


I don't know whether you can easily determine page_mapcount with
page_count and flags, though (count gives you an educated guess,
but mapcount is the real thing).

page_mapcount sounds very reasonable to export. It is directly
tied with the userspace concept of mapping pages. page_count doesn't
seem very useful (and if you must have it, please use page_count),
neither does page flags.

You could have a bit indicating whether the page is free or not (but
that doesn't tell you much that meminfo or zoneinfo or buddyinfo does
not). Dirty/writeback/referenced/uptodate maybe?... I'm stumped,
what's flags for?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Thu, 12 Apr 2007 16:10:50 -0700
William Lee Irwin III <[EMAIL PROTECTED]> wrote:



+   while (count > 0) {
+   chunk = min_t(size_t, count, PAGE_SIZE);
+   i = 0;
+
+   if (pfn == -1) {
+   page[0] = 0;
+   page[1] = 0;
+   ((char *)page)[0] = (ntohl(1) != 1);



OK.



+   ((char *)page)[1] = PAGE_SHIFT;



OK.


Shouldn't we just expose page size and endianness by other means? (another file 
or
syscall).


+   for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) {
+   ppage = pfn_to_page(pfn);
+   if (!ppage) {
+   page[i] = 0;
+   page[i + 1] = 0;
+   } else {
+   page[i] = ppage->flags;
+   page[i + 1] = atomic_read(>_count);
+   }
+   }



Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.


I don't think it is viable because that makes the flags part of the
userspace ABI. I wonder what they are needed for.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread William Lee Irwin III
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> 
wrote:
>> This solves a real-life problem for Oracle system monitoring software
>> (specifically EM). Among the tasks it must carry out is determining
>> per-process memory footprint of a set of cooperating tasks (i.e. Oracle
>> processes). RSS is inadequate for this due to page sharing; this work
>> provides sufficient information to determine what EM needs.

On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote:
> Not a good idea to expose raw flags in this manner - it changes at the drop
> of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
> mapping to make this viable.
> Not a good idea to use page->_count: page_count() will be more stable. 
> Otherwise OK, I guess: the interpretation of the page refcount is unlikely
> to change much over time.

EM wants to determine page_mapcount() for the most part for the
purposes of determining "uniquely attributable RSS" (my ca. 2004
nomenclature) or "proportional RSS" (mpm's more recent nomenclature);
as things now stand it will have to infer them by maintaining a table
of pfn's and mappings thereof, but at least that can be done with it.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Thu, 12 Apr 2007 16:10:50 -0700
William Lee Irwin III <[EMAIL PROTECTED]> wrote:

> On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote:
> > This patch series introduces /proc/pid/pagemap and /proc/kpagemap,
> > which allow detailed run-time examination of process memory usage at a
> > page granularity.
> > The first several patches whip the page-walking code introduced for
> > /proc/pid/smaps and clear_refs into a more generic form, the next
> > couple make those interfaces optional, and the last two introduce the
> > new interfaces, also optional.
> 
> This solves a real-life problem for Oracle system monitoring software
> (specifically EM). Among the tasks it must carry out is determining
> per-process memory footprint of a set of cooperating tasks (i.e. Oracle
> processes). RSS is inadequate for this due to page sharing; this work
> provides sufficient information to determine what EM needs.
> 
> 

I'm still dying to see what the human-readable output from this
thing looks like.



> + * Each entry is a pair of unsigned longs representing the
> + * corresponding physical page, the first containing the page flags
> + * and the second containing the page use count.
> + *
> + * The first 4 bytes of this file form a simple header:
> + *
> + * first byte:   0 for big endian, 1 for little
> + * second byte:  page shift (eg 12 for 4096 byte pages)
> + * third byte:   entry size in bytes (currently either 4 or 8)
> + * fourth byte:  header size
>
> ...
>
> +   while (count > 0) {
> +   chunk = min_t(size_t, count, PAGE_SIZE);
> +   i = 0;
> +
> +   if (pfn == -1) {
> +   page[0] = 0;
> +   page[1] = 0;
> +   ((char *)page)[0] = (ntohl(1) != 1);

OK.

> +   ((char *)page)[1] = PAGE_SHIFT;

OK.

> +   ((char *)page)[2] = sizeof(unsigned long);

OK.

> +   ((char *)page)[3] = KPMSIZE;

OK.

> +   i = 2;
> +   pfn++;
> +   }
> +
> +   for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) {
> +   ppage = pfn_to_page(pfn);
> +   if (!ppage) {
> +   page[i] = 0;
> +   page[i + 1] = 0;
> +   } else {
> +   page[i] = ppage->flags;
> +   page[i + 1] = atomic_read(>_count);
> +   }
> +   }

Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.

Not a good idea to use page->_count: page_count() will be more stable. 
Otherwise OK, I guess: the interpretation of the page refcount is unlikely
to change much over time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread William Lee Irwin III
On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote:
> This patch series introduces /proc/pid/pagemap and /proc/kpagemap,
> which allow detailed run-time examination of process memory usage at a
> page granularity.
> The first several patches whip the page-walking code introduced for
> /proc/pid/smaps and clear_refs into a more generic form, the next
> couple make those interfaces optional, and the last two introduce the
> new interfaces, also optional.

This solves a real-life problem for Oracle system monitoring software
(specifically EM). Among the tasks it must carry out is determining
per-process memory footprint of a set of cooperating tasks (i.e. Oracle
processes). RSS is inadequate for this due to page sharing; this work
provides sufficient information to determine what EM needs.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread William Lee Irwin III
On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote:
 This patch series introduces /proc/pid/pagemap and /proc/kpagemap,
 which allow detailed run-time examination of process memory usage at a
 page granularity.
 The first several patches whip the page-walking code introduced for
 /proc/pid/smaps and clear_refs into a more generic form, the next
 couple make those interfaces optional, and the last two introduce the
 new interfaces, also optional.

This solves a real-life problem for Oracle system monitoring software
(specifically EM). Among the tasks it must carry out is determining
per-process memory footprint of a set of cooperating tasks (i.e. Oracle
processes). RSS is inadequate for this due to page sharing; this work
provides sufficient information to determine what EM needs.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Thu, 12 Apr 2007 16:10:50 -0700
William Lee Irwin III [EMAIL PROTECTED] wrote:

 On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote:
  This patch series introduces /proc/pid/pagemap and /proc/kpagemap,
  which allow detailed run-time examination of process memory usage at a
  page granularity.
  The first several patches whip the page-walking code introduced for
  /proc/pid/smaps and clear_refs into a more generic form, the next
  couple make those interfaces optional, and the last two introduce the
  new interfaces, also optional.
 
 This solves a real-life problem for Oracle system monitoring software
 (specifically EM). Among the tasks it must carry out is determining
 per-process memory footprint of a set of cooperating tasks (i.e. Oracle
 processes). RSS is inadequate for this due to page sharing; this work
 provides sufficient information to determine what EM needs.
 
 

I'm still dying to see what the human-readable output from this
thing looks like.

looks

 + * Each entry is a pair of unsigned longs representing the
 + * corresponding physical page, the first containing the page flags
 + * and the second containing the page use count.
 + *
 + * The first 4 bytes of this file form a simple header:
 + *
 + * first byte:   0 for big endian, 1 for little
 + * second byte:  page shift (eg 12 for 4096 byte pages)
 + * third byte:   entry size in bytes (currently either 4 or 8)
 + * fourth byte:  header size

 ...

 +   while (count  0) {
 +   chunk = min_t(size_t, count, PAGE_SIZE);
 +   i = 0;
 +
 +   if (pfn == -1) {
 +   page[0] = 0;
 +   page[1] = 0;
 +   ((char *)page)[0] = (ntohl(1) != 1);

OK.

 +   ((char *)page)[1] = PAGE_SHIFT;

OK.

 +   ((char *)page)[2] = sizeof(unsigned long);

OK.

 +   ((char *)page)[3] = KPMSIZE;

OK.

 +   i = 2;
 +   pfn++;
 +   }
 +
 +   for (; i  2 * chunk / KPMSIZE; i += 2, pfn++) {
 +   ppage = pfn_to_page(pfn);
 +   if (!ppage) {
 +   page[i] = 0;
 +   page[i + 1] = 0;
 +   } else {
 +   page[i] = ppage-flags;
 +   page[i + 1] = atomic_read(ppage-_count);
 +   }
 +   }

Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.

Not a good idea to use page-_count: page_count() will be more stable. 
Otherwise OK, I guess: the interpretation of the page refcount is unlikely
to change much over time.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread William Lee Irwin III
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] 
wrote:
 This solves a real-life problem for Oracle system monitoring software
 (specifically EM). Among the tasks it must carry out is determining
 per-process memory footprint of a set of cooperating tasks (i.e. Oracle
 processes). RSS is inadequate for this due to page sharing; this work
 provides sufficient information to determine what EM needs.

On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote:
 Not a good idea to expose raw flags in this manner - it changes at the drop
 of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
 mapping to make this viable.
 Not a good idea to use page-_count: page_count() will be more stable. 
 Otherwise OK, I guess: the interpretation of the page refcount is unlikely
 to change much over time.

EM wants to determine page_mapcount() for the most part for the
purposes of determining uniquely attributable RSS (my ca. 2004
nomenclature) or proportional RSS (mpm's more recent nomenclature);
as things now stand it will have to infer them by maintaining a table
of pfn's and mappings thereof, but at least that can be done with it.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Thu, 12 Apr 2007 16:10:50 -0700
William Lee Irwin III [EMAIL PROTECTED] wrote:



+   while (count  0) {
+   chunk = min_t(size_t, count, PAGE_SIZE);
+   i = 0;
+
+   if (pfn == -1) {
+   page[0] = 0;
+   page[1] = 0;
+   ((char *)page)[0] = (ntohl(1) != 1);



OK.



+   ((char *)page)[1] = PAGE_SHIFT;



OK.


Shouldn't we just expose page size and endianness by other means? (another file 
or
syscall).


+   for (; i  2 * chunk / KPMSIZE; i += 2, pfn++) {
+   ppage = pfn_to_page(pfn);
+   if (!ppage) {
+   page[i] = 0;
+   page[i + 1] = 0;
+   } else {
+   page[i] = ppage-flags;
+   page[i + 1] = atomic_read(ppage-_count);
+   }
+   }



Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.


I don't think it is viable because that makes the flags part of the
userspace ABI. I wonder what they are needed for.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

William Lee Irwin III wrote:

On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] 
wrote:


This solves a real-life problem for Oracle system monitoring software
(specifically EM). Among the tasks it must carry out is determining
per-process memory footprint of a set of cooperating tasks (i.e. Oracle
processes). RSS is inadequate for this due to page sharing; this work
provides sufficient information to determine what EM needs.



On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote:


Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.
Not a good idea to use page-_count: page_count() will be more stable. 
Otherwise OK, I guess: the interpretation of the page refcount is unlikely

to change much over time.



EM wants to determine page_mapcount() for the most part for the
purposes of determining uniquely attributable RSS (my ca. 2004
nomenclature) or proportional RSS (mpm's more recent nomenclature);
as things now stand it will have to infer them by maintaining a table
of pfn's and mappings thereof, but at least that can be done with it.


I don't know whether you can easily determine page_mapcount with
page_count and flags, though (count gives you an educated guess,
but mapcount is the real thing).

page_mapcount sounds very reasonable to export. It is directly
tied with the userspace concept of mapping pages. page_count doesn't
seem very useful (and if you must have it, please use page_count),
neither does page flags.

You could have a bit indicating whether the page is free or not (but
that doesn't tell you much that meminfo or zoneinfo or buddyinfo does
not). Dirty/writeback/referenced/uptodate maybe?... I'm stumped,
what's flags for?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote:
 On Thu, 12 Apr 2007 16:10:50 -0700
 William Lee Irwin III [EMAIL PROTECTED] wrote:
 
  On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote:
   This patch series introduces /proc/pid/pagemap and /proc/kpagemap,
   which allow detailed run-time examination of process memory usage at a
   page granularity.
   The first several patches whip the page-walking code introduced for
   /proc/pid/smaps and clear_refs into a more generic form, the next
   couple make those interfaces optional, and the last two introduce the
   new interfaces, also optional.
  
  This solves a real-life problem for Oracle system monitoring software
  (specifically EM). Among the tasks it must carry out is determining
  per-process memory footprint of a set of cooperating tasks (i.e. Oracle
  processes). RSS is inadequate for this due to page sharing; this work
  provides sufficient information to determine what EM needs.
 
 I'm still dying to see what the human-readable output from this
 thing looks like.

Still a work-in-progress. It's a monstrous amount of data and it
basically requires a GUI to really get a handle on. Here's a couple
apps I've been tinkering with (aka My First GTK Apps):

http://selenic.com/Screenshot-pagemap.png

That's a snapshot of a live-updating image of memory usage for a
running process (Galeon). Each pixel is a page. Each 32x32 block is
4MB. Mappings are dark red. Pages that are actually faulted in are
bright red. You can poke around in the memory map with the mouse and
highlight mappings (blue). And pages that get faulted in flash green
(hard to capture in a screenshot).

http://selenic.com/Screenshot-kpagemap.png

And that's a live-updating image of system-wide memory usage. Bright
red are pages with a count of 1, dark red are pages with higher
counts. Next is to visualize slab/page cache/buddy/active/lru data as
well as highlight changing pages.

This isn't terribly interesting yet. It can tell you things about page
cache usage and fragmentation and readahead and so on.

But correlating across the two sources, we'll be able to show
information like what pages in a process are actually
shared/active/lru/etc. You can take it even further by correlating
the above data with symbol info from nm, /proc/pid/clear_refs, etc.

Also, something I immediately noticed on looking at the raw data
(cat /proc/`pidof`/pagemap | hexdump -C | less):

002c8fd0  ff ff ff ff ff ff ff ff  ff ff ff ff 6d f8 03 00 |m...|
002c8fe0  6c f8 03 00 b9 f8 03 00  6b f8 03 00 6a f8 03 00 |l...k...j...|
002c8ff0  b8 f8 03 00 69 f8 03 00  68 f8 03 00 b7 f8 03 00 |i...h...|
002c9000  67 f8 03 00 66 f8 03 00  b6 f8 03 00 65 f8 03 00 |g...f...e...|
002c9010  64 f8 03 00 b5 f8 03 00  63 f8 03 00 62 f8 03 00 |d...c...b...|
002c9020  b4 f8 03 00 61 f8 03 00  60 f8 03 00 b3 f8 03 00 |a...`...|
002c9030  7f f8 03 00 7e f8 03 00  b2 f8 03 00 7d f8 03 00 |~...}...|
002c9040  7c f8 03 00 b1 f8 03 00  5f f8 03 00 5e f8 03 00 ||..._...^...|
002c9050  b0 f8 03 00 5d f8 03 00  5c f8 03 00 af f8 03 00 |]...\...|

Most of the consecutive page frames are allocated in descending order
(6d 6c 6b 6a ...). That's pessimal for physical merging of block I/O.
Given that we theoretically fixed this long-standing problem in 2.6
but it's obviously still happening, it's clear that a little more
visibility into the VM would be useful.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote:
 Andrew Morton wrote:
 On Thu, 12 Apr 2007 16:10:50 -0700
 William Lee Irwin III [EMAIL PROTECTED] wrote:
 
 +   while (count  0) {
 +   chunk = min_t(size_t, count, PAGE_SIZE);
 +   i = 0;
 +
 +   if (pfn == -1) {
 +   page[0] = 0;
 +   page[1] = 0;
 +   ((char *)page)[0] = (ntohl(1) != 1);
 
 
 OK.
 
 
 +   ((char *)page)[1] = PAGE_SHIFT;
 
 
 OK.
 
 Shouldn't we just expose page size and endianness by other means? (another 
 file or
 syscall).

If I send you this file dumped from a random machine, you won't know
what to make of it.

I'm planning to write a trivial server to sit on, say, my embedded
target and spew this over the wire to a client. 

 Not a good idea to expose raw flags in this manner - it changes at the drop
 of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
 mapping to make this viable.
 
 I don't think it is viable because that makes the flags part of the
 userspace ABI. I wonder what they are needed for.

Basically: to show what the hell's going on in the VM.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 +   ((char *)page)[1] = PAGE_SHIFT;
  
  
  OK.
 
 Shouldn't we just expose page size and endianness by other means? (another 
 file or
 syscall).

I don't think so - this file exposes fairly deep kernel internals and
that's unavoidable, really - it's *supposed* to do that.  It is explicitly
designed for monitoring kernel behaviour.

So it needs special handling by userspace.  Keeping the number of files
which need such special handling to a minimum will keep the number of
applications which are exposed to kernel changes to a minimum.

 +   for (; i  2 * chunk / KPMSIZE; i += 2, pfn++) {
 +   ppage = pfn_to_page(pfn);
 +   if (!ppage) {
 +   page[i] = 0;
 +   page[i + 1] = 0;
 +   } else {
 +   page[i] = ppage-flags;
 +   page[i + 1] = atomic_read(ppage-_count);
 +   }
 +   }
  
  
  Not a good idea to expose raw flags in this manner - it changes at the drop
  of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
  mapping to make this viable.
 
 I don't think it is viable because that makes the flags part of the
 userspace ABI.

It *will* be viable.  If the application wants to know if a page is dirty,
it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
numerical offset when inspecting fields in /proc/kpagemap.  If correctly
designed, such a monitoring application will be able to report upon page
flags which we haven't even thought up yet.

 I wonder what they are needed for.

Poking deeply into the kernel to provide information about kernel state. 

There are real-world needs for this, and the people who develop tools to
process this information will have decent kernel understanding and will
know that the file's contents may alter across kernel versions.  It sure
beats poking around in /dev/kmem.

I doubt if there's a sensible way in which we can prettify this interface
without losing information.  But we should aim to make it as robust as
possible agaisnt future kenrel changes, of course.

And we should satisfy ourselves that all the required information has been
made available.  The fact that it will satisfy the Oracle requirement is
encouraging.

Matt, these changes make the new field in /proc/pid/smaps redundant, don't
they?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote:


Andrew Morton wrote:


On Thu, 12 Apr 2007 16:10:50 -0700
William Lee Irwin III [EMAIL PROTECTED] wrote:



+   while (count  0) {
+   chunk = min_t(size_t, count, PAGE_SIZE);
+   i = 0;
+
+   if (pfn == -1) {
+   page[0] = 0;
+   page[1] = 0;
+   ((char *)page)[0] = (ntohl(1) != 1);



OK.




+   ((char *)page)[1] = PAGE_SHIFT;



OK.


Shouldn't we just expose page size and endianness by other means? (another 
file or

syscall).



If I send you this file dumped from a random machine, you won't know
what to make of it.


That's a good reason ;)


I'm planning to write a trivial server to sit on, say, my embedded
target and spew this over the wire to a client. 




Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.


I don't think it is viable because that makes the flags part of the
userspace ABI. I wonder what they are needed for.



Basically: to show what the hell's going on in the VM.


kprobes / systemtap isn't good enough?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote:



+   for (; i  2 * chunk / KPMSIZE; i += 2, pfn++) {
+   ppage = pfn_to_page(pfn);
+   if (!ppage) {
+   page[i] = 0;
+   page[i + 1] = 0;
+   } else {
+   page[i] = ppage-flags;
+   page[i + 1] = atomic_read(ppage-_count);
+   }
+   }



Not a good idea to expose raw flags in this manner - it changes at the drop
of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
mapping to make this viable.


I don't think it is viable because that makes the flags part of the
userspace ABI.



It *will* be viable.  If the application wants to know if a page is dirty,
it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
numerical offset when inspecting fields in /proc/kpagemap.  If correctly
designed, such a monitoring application will be able to report upon page
flags which we haven't even thought up yet.


Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
Still seems like a basically hit and miss affair to just use flags. What if
you want to know the process mapping a page? With systemtap or something you
could walk the rmap structures. What if you want to look at pages along the
LRU list rather than per-pfn? What about connecting pages to inodes?

I thought this type of deep poking was the whole reason the probles thingies
were merged. I'm saddened that they're no good for this. I thought it would
be an ideal usage :(



I wonder what they are needed for.



Poking deeply into the kernel to provide information about kernel state. 


There are real-world needs for this, and the people who develop tools to
process this information will have decent kernel understanding and will
know that the file's contents may alter across kernel versions.  It sure
beats poking around in /dev/kmem.

I doubt if there's a sensible way in which we can prettify this interface
without losing information.  But we should aim to make it as robust as
possible agaisnt future kenrel changes, of course.

And we should satisfy ourselves that all the required information has been
made available.  The fact that it will satisfy the Oracle requirement is
encouraging.


Yeah it is close, they need page_mapcount I think. But I was going to say
that satisfying an Oracle requirement is a good reason _not_ to merge it ;)
(I joke!)

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote:
 
 +   for (; i  2 * chunk / KPMSIZE; i += 2, pfn++) {
 +   ppage = pfn_to_page(pfn);
 +   if (!ppage) {
 +   page[i] = 0;
 +   page[i + 1] = 0;
 +   } else {
 +   page[i] = ppage-flags;
 +   page[i + 1] = atomic_read(ppage-_count);
 +   }
 +   }
 
 
 Not a good idea to expose raw flags in this manner - it changes at the drop
 of a hat.  We'd need to also expose the kernel's PG_foo-to-bitnumber
 mapping to make this viable.
 
 I don't think it is viable because that makes the flags part of the
 userspace ABI.
  
  
  It *will* be viable.  If the application wants to know if a page is dirty,
  it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
  numerical offset when inspecting fields in /proc/kpagemap.  If correctly
  designed, such a monitoring application will be able to report upon page
  flags which we haven't even thought up yet.
 
 Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
 Still seems like a basically hit and miss affair to just use flags. What if
 you want to know the process mapping a page? With systemtap or something you
 could walk the rmap structures. What if you want to look at pages along the
 LRU list rather than per-pfn? What about connecting pages to inodes?

Well hang on.  This isn't a tool for understanding kernel behaviour.  It's
a tool for understanding applciation behaviour.

So one doesn't ask who is mapping that page - that's a kernel developer
thing.

Instead, one says what pages are being used by my application, then, for
each of those pages what is that page's state.  So the first step is to
collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
use those pfns to look the individual pages up in /proc/kpagemap.

If you really want to know who is using page 123435 then you'd need to
search /proc/*/pagemap.  There are possibly legitimate reasons why an
application developer would want to at least pertially perform such an
operation (who am I sharing with), but I doubt if it's the common case.

 
 But I was going to say
 that satisfying an Oracle requirement is a good reason _not_ to merge it ;)


hm, yes, there's plenty of precedent for that.

 (I joke!)

I akpm!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin [EMAIL PROTECTED] wrote:



Andrew Morton wrote:



It *will* be viable.  If the application wants to know if a page is dirty,
it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
numerical offset when inspecting fields in /proc/kpagemap.  If correctly
designed, such a monitoring application will be able to report upon page
flags which we haven't even thought up yet.


Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
Still seems like a basically hit and miss affair to just use flags. What if
you want to know the process mapping a page? With systemtap or something you
could walk the rmap structures. What if you want to look at pages along the
LRU list rather than per-pfn? What about connecting pages to inodes?



Well hang on.  This isn't a tool for understanding kernel behaviour.  It's
a tool for understanding applciation behaviour.

So one doesn't ask who is mapping that page - that's a kernel developer
thing.

Instead, one says what pages are being used by my application, then, for


That includes unmapped pagecache being used by my application, doesn't it?
Maybe that's too hard to do via /proc so we forget about it...



each of those pages what is that page's state.  So the first step is to
collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
use those pfns to look the individual pages up in /proc/kpagemap.


OK I realise you could do it that way, but systemtap can definitely be
used as a tool for understanding application behaviour in the context of
the kernel, I think? The purpose for it is so that various little bits
of deep kernel internals do not have to be exposed on a case by case basis.

If kprobes is simply crappy and doesn't work properly for this, then I
could accept that. I'm not someone trying to get this info. So why can't
it be used? (not just for kpagemap, but for clear_refs and all that gunk
too).

 If you really want to know who is using page 123435 then you'd need to
 search /proc/*/pagemap.  There are possibly legitimate reasons why an
 application developer would want to at least pertially perform such an
 operation (who am I sharing with), but I doubt if it's the common case.

Maybe. How about LRU? Reclaim performance is bad, and you want to work out
which pages keep going off the end of it, or which pages keep getting
written out via it, or who's pages are on the active list, forcing mine
out.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote:
 Basically: to show what the hell's going on in the VM.
 
 kprobes / systemtap isn't good enough?

It's not really a good match to the kprobes model. I'm not interested
in events, per se. I don't want to need to know about every single
alloc/free of N different varieties integrated from boot onward to
build up an image of the state of the system. Instead, I want to take
snapshots of the state of the VM.

The main goal here is to be able to answer the question where's my
memory going?. Currently you can't really give a good answer to that
question from userspace because of shared mappings, etc.

There are lots of secondary questions that follow on very quickly from
that, like what parts of my shared mappings are or aren't shared, and
why?, what's actually in my application's working set? and how much
of this crap can I ditch?.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin [EMAIL PROTECTED] wrote:
  
  
 Andrew Morton wrote:
 
 It *will* be viable.  If the application wants to know if a page is dirty,
 it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's
 numerical offset when inspecting fields in /proc/kpagemap.  If correctly
 designed, such a monitoring application will be able to report upon page
 flags which we haven't even thought up yet.
 
 Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works.
 Still seems like a basically hit and miss affair to just use flags. What if
 you want to know the process mapping a page? With systemtap or something you
 could walk the rmap structures. What if you want to look at pages along the
 LRU list rather than per-pfn? What about connecting pages to inodes?
  
  
  Well hang on.  This isn't a tool for understanding kernel behaviour.  It's
  a tool for understanding applciation behaviour.
  
  So one doesn't ask who is mapping that page - that's a kernel developer
  thing.
  
  Instead, one says what pages are being used by my application, then, for
 
 That includes unmapped pagecache being used by my application, doesn't it?
 Maybe that's too hard to do via /proc so we forget about it...

Yes, harder.  I'm hoping that sampling of /proc/pid/io can be used to
determine pagecache use sufficiently accurately.  I know of one large
hosting company who are using it (BTW, we are making great use of
taskstats!!  Its GREAT!)

 
  each of those pages what is that page's state.  So the first step is to
  collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
  use those pfns to look the individual pages up in /proc/kpagemap.
 
 OK I realise you could do it that way, but systemtap can definitely be
 used as a tool for understanding application behaviour in the context of
 the kernel, I think? The purpose for it is so that various little bits
 of deep kernel internals do not have to be exposed on a case by case basis.
 
 If kprobes is simply crappy and doesn't work properly for this, then I
 could accept that. I'm not someone trying to get this info. So why can't
 it be used? (not just for kpagemap, but for clear_refs and all that gunk
 too).
 
   If you really want to know who is using page 123435 then you'd need to
   search /proc/*/pagemap.  There are possibly legitimate reasons why an
   application developer would want to at least pertially perform such an
   operation (who am I sharing with), but I doubt if it's the common case.
 
 Maybe. How about LRU? Reclaim performance is bad, and you want to work out
 which pages keep going off the end of it, or which pages keep getting
 written out via it, or who's pages are on the active list, forcing mine
 out.

I guess we have static analysis versus dynamic.  The interfaces which Matt
is proposing are suited to answering the question what is my memory being
used for (static).  They're unlikely to be useful for answering the question
what's happening in the VM (dynamic).  Systemtap is probably better for the
dynamic analysis.

I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
 Instead, one says what pages are being used by my application, then, for
 
 That includes unmapped pagecache being used by my application, doesn't it?
 Maybe that's too hard to do via /proc so we forget about it...

It'd be really nice to have a window into the pagecache too. But I for
one couldn't come up with a sensible scheme for it.

 each of those pages what is that page's state.  So the first step is to
 collect all the pfns from /proc/$(pidof my-application)/pagemap and then to
 use those pfns to look the individual pages up in /proc/kpagemap.
 
 OK I realise you could do it that way, but systemtap can definitely be
 used as a tool for understanding application behaviour in the context of
 the kernel, I think? The purpose for it is so that various little bits
 of deep kernel internals do not have to be exposed on a case by case basis.
 
 If kprobes is simply crappy and doesn't work properly for this, then I
 could accept that. I'm not someone trying to get this info. So why can't
 it be used? (not just for kpagemap, but for clear_refs and all that gunk
 too).

kprobes is good for looking at events, but bad for looking at state.
Especially metric shitloads of state.

  If you really want to know who is using page 123435 then you'd need to
  search /proc/*/pagemap.  There are possibly legitimate reasons why an
  application developer would want to at least pertially perform such an
  operation (who am I sharing with), but I doubt if it's the common case.
 
 Maybe. How about LRU? Reclaim performance is bad, and you want to work out
 which pages keep going off the end of it, or which pages keep getting
 written out via it, or who's pages are on the active list, forcing mine
 out.

Those are actually probably a good match for systemtap as they're all events.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote:


Basically: to show what the hell's going on in the VM.


kprobes / systemtap isn't good enough?



It's not really a good match to the kprobes model. I'm not interested
in events, per se. I don't want to need to know about every single
alloc/free of N different varieties integrated from boot onward to
build up an image of the state of the system. Instead, I want to take
snapshots of the state of the VM.


Systemtap can't output a large set of values?

Why can't you attach a kprobe to a dummy syscall, and from there
iterate over pgdat/zone/memmap and output what you want?

Actually I'm surprised that kind of data querying facility isn't
already in there (I haven't used it seriously though).



The main goal here is to be able to answer the question where's my
memory going?. Currently you can't really give a good answer to that
question from userspace because of shared mappings, etc.

There are lots of secondary questions that follow on very quickly from
that, like what parts of my shared mappings are or aren't shared, and
why?, what's actually in my application's working set? and how much
of this crap can I ditch?.


I understand roughly what you want, and that you can't easily get
it from /proc currently. My question at this point is just why can
we not use systemtap.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote:
 I guess one could generate an answer to the static question with systemtap,
 by accumulating running counts across the application lifetime and then
 snapshotting them.  Sounds hard though.

You'd have to do it from boot onward to get a complete system image.
One way to look at it is that systemtap can give you the derivative of
the information, and you have to integrate it.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Andrew Morton wrote:

On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin [EMAIL PROTECTED] wrote:



Maybe. How about LRU? Reclaim performance is bad, and you want to work out
which pages keep going off the end of it, or which pages keep getting
written out via it, or who's pages are on the active list, forcing mine
out.



I guess we have static analysis versus dynamic.  The interfaces which Matt
is proposing are suited to answering the question what is my memory being
used for (static).  They're unlikely to be useful for answering the question
what's happening in the VM (dynamic).  Systemtap is probably better for the
dynamic analysis.


what is my memory being used for *now* ;)



I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.


Can't you just traverse arbitrary kernel data structures at a given point
in time, exactly like the /proc/ call is doing?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:



If kprobes is simply crappy and doesn't work properly for this, then I
could accept that. I'm not someone trying to get this info. So why can't
it be used? (not just for kpagemap, but for clear_refs and all that gunk
too).



kprobes is good for looking at events, but bad for looking at state.
Especially metric shitloads of state.


Why? Why is a kprobes trap significantly more expensive than a read
syscall?


Maybe. How about LRU? Reclaim performance is bad, and you want to work out
which pages keep going off the end of it, or which pages keep getting
written out via it, or who's pages are on the active list, forcing mine
out.



Those are actually probably a good match for systemtap as they're all events.


Traverse the LRU? Which files do they belong to? What process maps them?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Nick Piggin

Matt Mackall wrote:

On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote:


I guess one could generate an answer to the static question with systemtap,
by accumulating running counts across the application lifetime and then
snapshotting them.  Sounds hard though.



You'd have to do it from boot onward to get a complete system image.
One way to look at it is that systemtap can give you the derivative of
the information, and you have to integrate it.


So everyone keeps saying.

Would you tell me why you can't just traverse the data structures
in the same way as your proc handler? From the systemtap example
scripts it seems like you can traverse arbitrary kernel data
structures.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

  I guess one could generate an answer to the static question with systemtap,
  by accumulating running counts across the application lifetime and then
  snapshotting them.  Sounds hard though.
 
 Can't you just traverse arbitrary kernel data structures at a given point
 in time, exactly like the /proc/ call is doing?

Do a full pagetable walk, with all the associated locking from within
a systemtap script?  I'd be surprised.  Maybe if it's mostly hand-coded
in C, perhaps.  Then you just end up with the same thing, don't you?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups

2007-04-12 Thread Matt Mackall
On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote:
 Matt Mackall wrote:
 On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote:
 
 If kprobes is simply crappy and doesn't work properly for this, then I
 could accept that. I'm not someone trying to get this info. So why can't
 it be used? (not just for kpagemap, but for clear_refs and all that gunk
 too).
 
 
 kprobes is good for looking at events, but bad for looking at state.
 Especially metric shitloads of state.
 
 Why? Why is a kprobes trap significantly more expensive than a read
 syscall?

I guess I'm not clear on what you're proposing. From my understanding
of kprobes (admittedly not an expert), this is hard to do and not a
very good match.
 
 Maybe. How about LRU? Reclaim performance is bad, and you want to work out
 which pages keep going off the end of it, or which pages keep getting
 written out via it, or who's pages are on the active list, forcing mine
 out.
 
 
 Those are actually probably a good match for systemtap as they're all 
 events.
 
 Traverse the LRU? Which files do they belong to? What process maps them?

-ENOPARSE.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >