Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Hi - On Mon, Apr 16, 2007 at 11:36:05PM +0200, Andi Kleen wrote: > Christoph Hellwig <[EMAIL PROTECTED]> writes: > > and [systemtap] does a lot of really wrong things in it's > > runtime). [...] (Thanks, Christoph, for at least a few specifics. Some of them have already been dealt with in the recent past.) > I must agree with that. Perhaps it would be good if its runtime code > was posted to l-k at some point and reviewed in the standard way > even when it isn't merged. I'll let the runtime's maintainers judge whether this particular venue would be helpful. But is the choice of venue really an obstacle? Everyone who cares is *already* welcome to browse the code (available on CVS, cvsweb, tarballs - would git help?), and critique it (e.g. on our open public mailing list or via bugzilla). http://sourceware.org/systemtap/getinvolved.html - FChE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig <[EMAIL PROTECTED]> writes: > and does a lot of really wrong things in it's runtime). I must agree with that. Perhaps it would be good if its runtime code was posted to l-k at some point and reviewed in the standard way even when it isn't merged. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 10:08:27AM -0400, Theodore Tso wrote: > On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: > > With systemtap scripts, you could walk pagetables and print *the exact > > page information you want*, or you could walk pfns, or LRU, or page_tree, > > or walk the page tree then the rmap structures. And you can selectively > > cull out items you don't care about if you only care about a subset of > > items, based on arbitrary criteria. And you can most likely do all that > > more efficiently than with a conglomeration of various /proc files > > (assuming they even provide what you want in the first place). > > Yes, but maintaining the systemtap scripts will be a nightmare, since > they would be outside the kernel, and as we change our internal data > structure, the scripts would become useless. > > This is a fundamental problem with systemtap that we haven't been able > to solve yet, because solving it would freeze various internal data > structures or kernel functions. I agree that's not acceptable; which > is why I don't think systemtap would be a good match for the problem > we're trying to solve here. It's also fundamentally not solveable. Even Sun doesn't guarantee dor dtrace scripts to be portable, because it simply means you'd have to freeze all internals. Of course systemtap managment with their execute visibility and plain stupidity of copying whatever sun does will never ever get it. This whole mess will only be solvable if IBM fires the right people in managment. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 05:17:00PM -0400, Frank Ch. Eigler wrote: > It may be worthwhile to remind people that it is easy to use systemtap > only to the extent of automating the placement of kprobes: just to > perform the function-name/source-file/line-number triplet to PC > mapping. They can use embedded-C code to do all the same stuff they'd > do with kprobes. They are not obligated to write any odd script code > for probing logic, nor indeed use any of this really wrong runtime. Umm, yes- as long as you write systemtap the runtime gets linked in currently. That doesn't mean you actually use a lot of it in the end, but the maintaince horror of actually getting all the junk code to compile still is there. Now the actual function-name/source-file/line-number triplet to PC is really useful functionality, and for my tracing work I could really use this a lot. Unfortunately systemtap doesn't have a proper layered approach and you can't use this bit without pulling in all the junk. If started some dward based function-name/source-file/line-number of my own based on acme's work, but it's stalled due to more important issues going on. > > We could not really distribute systemtap scripts with the kernel. > > systemtap is a bloody complicated piece of [software] > > I don't know if that should be treated a compliment to our team, for > being able to work quickly on something that a full-grown kernel > developer finds bloody complicated. Perhaps your information is > simply outdated. Big & bloated? We have several times asked for > specifics rather than smears - what about it? There's a lot of stuff unneeded for basic tasks. But if you want a detailed review you could submit your runtime bits for review and get feedback from everyone. > > outside the kernel tree that breaks all the time we change kernel > > internals. [...] > > That's begging the question. If kernel folks are willing to maintain > some included systemtap-related code, then by definition it would not > break all the time. We'll definitly need a trace transport. I currently use a hackish kfifo rinbuffer derived from net/ipv4/tcp_probe.c, but it's showing it's limitations. Tom promised long ago to factor our the trace code from blktrace into generic bits, but as he doesn't deliver I suspect I'll have to do that myself soon. The safe dereference bits are a bit questionable, but at least worth a try to put into the tree proper, because there's no chance they'd be properly maintained outside. The register dumps you do would could definitly stand some integration with the register dumps in panic messages, and would be useful library functions for proper C language kprobes, but that means detangling the core from the utterly horrible systemtrap pascal string handling. Stack backtrace handling could use some integration with the stack tracing framework in for lockdep and fault injection and be available more genericly for C kprobes. With a proper tracing infrastructure we'll need the timing bits for it aswell, which should superceed the utter mess in systemtap in that area (I'm hoping for Matthew to come up with something there as part of lttng) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 05:17:00PM -0400, Frank Ch. Eigler wrote: It may be worthwhile to remind people that it is easy to use systemtap only to the extent of automating the placement of kprobes: just to perform the function-name/source-file/line-number triplet to PC mapping. They can use embedded-C code to do all the same stuff they'd do with kprobes. They are not obligated to write any odd script code for probing logic, nor indeed use any of this really wrong runtime. Umm, yes- as long as you write systemtap the runtime gets linked in currently. That doesn't mean you actually use a lot of it in the end, but the maintaince horror of actually getting all the junk code to compile still is there. Now the actual function-name/source-file/line-number triplet to PC is really useful functionality, and for my tracing work I could really use this a lot. Unfortunately systemtap doesn't have a proper layered approach and you can't use this bit without pulling in all the junk. If started some dward based function-name/source-file/line-number of my own based on acme's work, but it's stalled due to more important issues going on. We could not really distribute systemtap scripts with the kernel. systemtap is a bloody complicated piece of [software] I don't know if that should be treated a compliment to our team, for being able to work quickly on something that a full-grown kernel developer finds bloody complicated. Perhaps your information is simply outdated. Big bloated? We have several times asked for specifics rather than smears - what about it? There's a lot of stuff unneeded for basic tasks. But if you want a detailed review you could submit your runtime bits for review and get feedback from everyone. outside the kernel tree that breaks all the time we change kernel internals. [...] That's begging the question. If kernel folks are willing to maintain some included systemtap-related code, then by definition it would not break all the time. We'll definitly need a trace transport. I currently use a hackish kfifo rinbuffer derived from net/ipv4/tcp_probe.c, but it's showing it's limitations. Tom promised long ago to factor our the trace code from blktrace into generic bits, but as he doesn't deliver I suspect I'll have to do that myself soon. The safe dereference bits are a bit questionable, but at least worth a try to put into the tree proper, because there's no chance they'd be properly maintained outside. The register dumps you do would could definitly stand some integration with the register dumps in panic messages, and would be useful library functions for proper C language kprobes, but that means detangling the core from the utterly horrible systemtrap pascal string handling. Stack backtrace handling could use some integration with the stack tracing framework in for lockdep and fault injection and be available more genericly for C kprobes. With a proper tracing infrastructure we'll need the timing bits for it aswell, which should superceed the utter mess in systemtap in that area (I'm hoping for Matthew to come up with something there as part of lttng) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 10:08:27AM -0400, Theodore Tso wrote: On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: With systemtap scripts, you could walk pagetables and print *the exact page information you want*, or you could walk pfns, or LRU, or page_tree, or walk the page tree then the rmap structures. And you can selectively cull out items you don't care about if you only care about a subset of items, based on arbitrary criteria. And you can most likely do all that more efficiently than with a conglomeration of various /proc files (assuming they even provide what you want in the first place). Yes, but maintaining the systemtap scripts will be a nightmare, since they would be outside the kernel, and as we change our internal data structure, the scripts would become useless. This is a fundamental problem with systemtap that we haven't been able to solve yet, because solving it would freeze various internal data structures or kernel functions. I agree that's not acceptable; which is why I don't think systemtap would be a good match for the problem we're trying to solve here. It's also fundamentally not solveable. Even Sun doesn't guarantee dor dtrace scripts to be portable, because it simply means you'd have to freeze all internals. Of course systemtap managment with their execute visibility and plain stupidity of copying whatever sun does will never ever get it. This whole mess will only be solvable if IBM fires the right people in managment. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig [EMAIL PROTECTED] writes: and does a lot of really wrong things in it's runtime). I must agree with that. Perhaps it would be good if its runtime code was posted to l-k at some point and reviewed in the standard way even when it isn't merged. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Hi - On Mon, Apr 16, 2007 at 11:36:05PM +0200, Andi Kleen wrote: Christoph Hellwig [EMAIL PROTECTED] writes: and [systemtap] does a lot of really wrong things in it's runtime). [...] (Thanks, Christoph, for at least a few specifics. Some of them have already been dealt with in the recent past.) I must agree with that. Perhaps it would be good if its runtime code was posted to l-k at some point and reviewed in the standard way even when it isn't merged. I'll let the runtime's maintainers judge whether this particular venue would be helpful. But is the choice of venue really an obstacle? Everyone who cares is *already* welcome to browse the code (available on CVS, cvsweb, tarballs - would git help?), and critique it (e.g. on our open public mailing list or via bugzilla). http://sourceware.org/systemtap/getinvolved.html - FChE - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 09:23:45PM -0500, Matt Mackall wrote: > On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: > > Matt Mackall wrote: > > >On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: > > > > >>If kprobes is simply crappy and doesn't work properly for this, then I > > >>could accept that. I'm not someone trying to get this info. So why can't > > >>it be used? (not just for kpagemap, but for clear_refs and all that gunk > > >>too). > > > > > > > > >kprobes is good for looking at events, but bad for looking at state. > > >Especially metric shitloads of state. > > > > Why? Why is a kprobes trap significantly more expensive than a read > > syscall? > > I guess I'm not clear on what you're proposing. From my understanding > of kprobes (admittedly not an expert), this is hard to do and not a > very good match. > > > >>Maybe. How about LRU? Reclaim performance is bad, and you want to work out > > >>which pages keep going off the end of it, or which pages keep getting > > >>written out via it, or who's pages are on the active list, forcing mine > > >>out. > > > > > > > > >Those are actually probably a good match for systemtap as they're all > > >events. > > > > Traverse the LRU? Which files do they belong to? What process maps them? > > -ENOPARSE. > For non-event based data gathering using kprobes we can have a debugfs file like /debug/kprobes/snapshot_probe and write a kprobe module with probe at ->write() function and then the user space can trigger the data collection echo "1" > /debug/kprobes/snapshot_probe Thus, the actual data collection code can reside in a separate module or a systemtap script which provides very good post-processing capabalities, and can be used without recompiling or rebooting the kernel. Thanks Maneesh -- Maneesh Soni Linux Technology Center, IBM India Systems and Technology Lab, Bangalore, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 09:23:45PM -0500, Matt Mackall wrote: On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? I guess I'm not clear on what you're proposing. From my understanding of kprobes (admittedly not an expert), this is hard to do and not a very good match. Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. Traverse the LRU? Which files do they belong to? What process maps them? -ENOPARSE. For non-event based data gathering using kprobes we can have a debugfs file like /debug/kprobes/snapshot_probe and write a kprobe module with probe at -write() function and then the user space can trigger the data collection echo 1 /debug/kprobes/snapshot_probe Thus, the actual data collection code can reside in a separate module or a systemtap script which provides very good post-processing capabalities, and can be used without recompiling or rebooting the kernel. Thanks Maneesh -- Maneesh Soni Linux Technology Center, IBM India Systems and Technology Lab, Bangalore, India - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig <[EMAIL PROTECTED]> writes: > [...] > > merge it in the first place? > > It's very nice to poke deep into the kernel for development purposes. > For example for the spu scheduler work I'm doing currently I have > a module using kprobes (note the systemtap crap because it's big, bloated, > in and odd language, and does a lot of really wrong things in its runtime). It may be worthwhile to remind people that it is easy to use systemtap only to the extent of automating the placement of kprobes: just to perform the function-name/source-file/line-number triplet to PC mapping. They can use embedded-C code to do all the same stuff they'd do with kprobes. They are not obligated to write any odd script code for probing logic, nor indeed use any of this really wrong runtime. > This module allows me to put probes into various places in the scheduler > and writes them into a ringbuffer with timestampts allowing me to > trace what's going on there. This is really neat. [...] Indeed, and we too try to make this simple & fast: a couple of lines. > [...] To summarize, I really love kprobes to ease my debugging work, > but using it for any kind of production code is a total nightmare. But at some point, some interface needs to be fixed for a final user-space tool. Whether that interface fixing is performed by kernel developers being more reluctant to rewrite basic things, or by providing a proc interface, or maintaining a kprobes module does not matter. Someone will feel constrained, and someone will be liberated. One neat thing about our systemtap tool is that, no matter what layer such interfaces become fixed within, it can probably interface to them. If there is no fixed interface, it can go down to debugging info. If there are tracing hooks present, it can attach. It can make appear as unified the disparate standardization policies of different subsystems. > > We could distribute some systemtap scripts, and even distribute some > > basic useful ones like this sort of page info in the kernel source > > tree. > > We could not really distribute systemtap scripts with the kernel. > systemtap is a bloody complicated piece of [software] I don't know if that should be treated a compliment to our team, for being able to work quickly on something that a full-grown kernel developer finds bloody complicated. Perhaps your information is simply outdated. Big & bloated? We have several times asked for specifics rather than smears - what about it? > outside the kernel tree that breaks all the time we change kernel > internals. [...] That's begging the question. If kernel folks are willing to maintain some included systemtap-related code, then by definition it would not break all the time. - FChE - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 12:24:51 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote: > > > From /proc/kpagemap + /proc/*/pagemap, you can > > > basically synthesize any statistic you want, including all the > > > existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will > > > be considerably more efficient. > > > > You'd need to poke clear_refs beforehand to make the referenced bits useful. > > > > Actually, we also need to run around the ptes and collect the pte-referenced > > bits too. I don't think your code copes with any of that? > > No, and it probably should. Perhaps dirty as well, though I've kindof > lost the plot on how that works lately. Dirty is OK: the VM keeps pte-dirtiness and page-dirtiness in sync now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 10:03:56AM -0700, Andrew Morton wrote: > On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote: > > > > It *will* be viable. If the application wants to know if a page is dirty, > > > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's > > > numerical offset when inspecting fields in /proc/kpagemap. If correctly > > > designed, such a monitoring application will be able to report upon page > > > flags which we haven't even thought up yet. > > > > We can probably fit this in the existing (variable-sized) header. > > hm, OK.. > > > > > I wonder what they are needed for. > > > > > > Poking deeply into the kernel to provide information about kernel state. > > > > > > There are real-world needs for this, and the people who develop tools to > > > process this information will have decent kernel understanding and will > > > know that the file's contents may alter across kernel versions. It sure > > > beats poking around in /dev/kmem. > > > > > > I doubt if there's a sensible way in which we can prettify this interface > > > without losing information. But we should aim to make it as robust as > > > possible agaisnt future kenrel changes, of course. > > > > > > And we should satisfy ourselves that all the required information has been > > > made available. The fact that it will satisfy the Oracle requirement is > > > encouraging. > > > > > > Matt, these changes make the new field in /proc/pid/smaps redundant, don't > > > they? > > > > Which new field? > > Referenced: > > > From /proc/kpagemap + /proc/*/pagemap, you can > > basically synthesize any statistic you want, including all the > > existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will > > be considerably more efficient. > > You'd need to poke clear_refs beforehand to make the referenced bits useful. > > Actually, we also need to run around the ptes and collect the pte-referenced > bits too. I don't think your code copes with any of that? No, and it probably should. Perhaps dirty as well, though I've kindof lost the plot on how that works lately. > > But in general, most of the statistics in smaps are basically useless > > for shared mappings, just like RSS. Problem is, we really don't know > > what statistics we want yet, or even if it can be distilled down to > > simple numbers anyway. > > yup. But that's the whole point, really: don't prejudge what info userspace > is trying to collect. Right. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:18:56PM +1000, Nick Piggin wrote: > Can't you just traverse arbitrary kernel data structures at a given point > in time, exactly like the /proc/ call is doing? Perhaps. My understanding is that you hook a kprobe to an event. An event is a particular instruction getting executed. Indeed, you can do whatever poking around in the kernel you want at that point. And then you can stuff that data in a buffer that eventually gets to userspace. This is very different from a read/seek/syscall. Rather than just asking the kernel for some data, we have to wait for the relevent events. Now, of course, you can make an ugly hack like hooking sys_getpid() and basically make your own system call. Hopefully no one else will call getpid() while you're doing this, etc. Not really how it's intended to work at all, and probably a bitch to use, but possible. Then the question becomes: why don't we do this for everything else in /proc? And the answer of course is: we put stuff in /proc because it's generally useful. Extra points if it's actually related to 'proc'esses. Being able to tell what's paged in in a given mapping is useful. Being able to tell what's shared between two mappings is useful. Being able to get an accurate, meaningful picture of how your memory is being used is useful. Heck, I bet some people might find it useful to be able to see what nodes the pages in their process are on. All stuff you shouldn't need to be a kernel hacker to answer. The flags part of /proc/kpagemap exposes some (very interesting!) implementation details. The rest of it is completely generic to any system with a VM. It's only deep kernel magic in the sense that it's not yet exposed. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall <[EMAIL PROTECTED]> wrote: > > It *will* be viable. If the application wants to know if a page is dirty, > > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's > > numerical offset when inspecting fields in /proc/kpagemap. If correctly > > designed, such a monitoring application will be able to report upon page > > flags which we haven't even thought up yet. > > We can probably fit this in the existing (variable-sized) header. hm, OK.. > > > I wonder what they are needed for. > > > > Poking deeply into the kernel to provide information about kernel state. > > > > There are real-world needs for this, and the people who develop tools to > > process this information will have decent kernel understanding and will > > know that the file's contents may alter across kernel versions. It sure > > beats poking around in /dev/kmem. > > > > I doubt if there's a sensible way in which we can prettify this interface > > without losing information. But we should aim to make it as robust as > > possible agaisnt future kenrel changes, of course. > > > > And we should satisfy ourselves that all the required information has been > > made available. The fact that it will satisfy the Oracle requirement is > > encouraging. > > > > Matt, these changes make the new field in /proc/pid/smaps redundant, don't > > they? > > Which new field? Referenced: > From /proc/kpagemap + /proc/*/pagemap, you can > basically synthesize any statistic you want, including all the > existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will > be considerably more efficient. You'd need to poke clear_refs beforehand to make the referenced bits useful. Actually, we also need to run around the ptes and collect the pte-referenced bits too. I don't think your code copes with any of that? > But in general, most of the statistics in smaps are basically useless > for shared mappings, just like RSS. Problem is, we really don't know > what statistics we want yet, or even if it can be distilled down to > simple numbers anyway. yup. But that's the whole point, really: don't prejudge what info userspace is trying to collect. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 05:42:01PM -0700, Andrew Morton wrote: > On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > >>+ ((char *)page)[1] = PAGE_SHIFT; > > > > > > > > > OK. > > > > Shouldn't we just expose page size and endianness by other means? (another > > file or > > syscall). > > I don't think so - this file exposes fairly deep kernel internals and > that's unavoidable, really - it's *supposed* to do that. It is explicitly > designed for monitoring kernel behaviour. > > So it needs special handling by userspace. Keeping the number of files > which need such special handling to a minimum will keep the number of > applications which are exposed to kernel changes to a minimum. > > > >>+ for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { > > >>+ ppage = pfn_to_page(pfn); > > >>+ if (!ppage) { > > >>+ page[i] = 0; > > >>+ page[i + 1] = 0; > > >>+ } else { > > >>+ page[i] = ppage->flags; > > >>+ page[i + 1] = atomic_read(>_count); > > >>+ } > > >>+ } > > > > > > > > > Not a good idea to expose raw flags in this manner - it changes at the > > > drop > > > of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > > > mapping to make this viable. > > > > I don't think it is viable because that makes the flags part of the > > userspace ABI. > > It *will* be viable. If the application wants to know if a page is dirty, > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's > numerical offset when inspecting fields in /proc/kpagemap. If correctly > designed, such a monitoring application will be able to report upon page > flags which we haven't even thought up yet. We can probably fit this in the existing (variable-sized) header. > > I wonder what they are needed for. > > Poking deeply into the kernel to provide information about kernel state. > > There are real-world needs for this, and the people who develop tools to > process this information will have decent kernel understanding and will > know that the file's contents may alter across kernel versions. It sure > beats poking around in /dev/kmem. > > I doubt if there's a sensible way in which we can prettify this interface > without losing information. But we should aim to make it as robust as > possible agaisnt future kenrel changes, of course. > > And we should satisfy ourselves that all the required information has been > made available. The fact that it will satisfy the Oracle requirement is > encouraging. > > Matt, these changes make the new field in /proc/pid/smaps redundant, don't > they? Which new field? From /proc/kpagemap + /proc/*/pagemap, you can basically synthesize any statistic you want, including all the existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will be considerably more efficient. But in general, most of the statistics in smaps are basically useless for shared mappings, just like RSS. Problem is, we really don't know what statistics we want yet, or even if it can be distilled down to simple numbers anyway. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: > With systemtap scripts, you could walk pagetables and print *the exact > page information you want*, or you could walk pfns, or LRU, or page_tree, > or walk the page tree then the rmap structures. And you can selectively > cull out items you don't care about if you only care about a subset of > items, based on arbitrary criteria. And you can most likely do all that > more efficiently than with a conglomeration of various /proc files > (assuming they even provide what you want in the first place). Yes, but maintaining the systemtap scripts will be a nightmare, since they would be outside the kernel, and as we change our internal data structure, the scripts would become useless. This is a fundamental problem with systemtap that we haven't been able to solve yet, because solving it would freeze various internal data structures or kernel functions. I agree that's not acceptable; which is why I don't think systemtap would be a good match for the problem we're trying to solve here. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Ananth N Mavinakayanahalli wrote: On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: It definitely seems like you can use some kernel functions, but the ones I saw may just be systemtap facilities. But what is so surprising about being able to call a kernel function when running in kernel context? Perhaps there is some fundamental limitation of kprobes that I don't understand. The main requirement for kprobes handlers is that they can't sleep. You could definitely call a kernel function from kprobe handlers as long as the function doesn't sleep. That would be enough to access basically all the VM data structures. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:54:36PM +1000, Nick Piggin wrote: > Matt Mackall wrote: > >On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: > > > >>Matt Mackall wrote: > >> > >>>On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: > >> > If kprobes is simply crappy and doesn't work properly for this, then I > could accept that. I'm not someone trying to get this info. So why can't > it be used? (not just for kpagemap, but for clear_refs and all that gunk > too). > >>> > >>> > >>>kprobes is good for looking at events, but bad for looking at state. > >>>Especially metric shitloads of state. > >> > >>Why? Why is a kprobes trap significantly more expensive than a read > >>syscall? > > > > > >I guess I'm not clear on what you're proposing. From my understanding > >of kprobes (admittedly not an expert), this is hard to do and not a > >very good match. > > But you have an idea that it is bad for exposing lots of data. Why? > (I'm not a kprobes expert either, these are not rhetorical questions) You could tie your kprobe module to use relay channels. Kprobe handlers run lockless and using the per-cpu relay channels will provide a fast transport mechanism for exposing lots of data. http://relayfs.sourceforge.net/examples.html#tprintk_kprobes is an example using the earlier relayfs interface. It shouldn't be that hard to change it to use the newer relay stuff. AFAIK acme is using a similar mechanism for ctracer (http://oops.ghostprotocols.net:81/blog/?p=50) Ananth - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: > Andrew Morton wrote: > >On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> > >wrote: > > > > > >>>I guess one could generate an answer to the static question with > >>>systemtap, > >>>by accumulating running counts across the application lifetime and then > >>>snapshotting them. Sounds hard though. > >> > >>Can't you just traverse arbitrary kernel data structures at a given point > >>in time, exactly like the /proc/ call is doing? > > > > > >Do a full pagetable walk, with all the associated locking from within > >a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded > >in C, perhaps. > > It looks like you can traverse arbitrary data structures, yes. > > It definitely seems like you can use some kernel functions, but the > ones I saw may just be systemtap facilities. But what is so surprising > about being able to call a kernel function when running in kernel > context? Perhaps there is some fundamental limitation of kprobes that > I don't understand. The main requirement for kprobes handlers is that they can't sleep. You could definitely call a kernel function from kprobe handlers as long as the function doesn't sleep. Ananth - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 06:25:46PM +1000, Nick Piggin wrote: > But at least make it into its own module with a debugfs interface or > something. I mean, exposing a PG_name-to-nr and page count pfn and flags > as a supposedly formal proc interface doesn't sound nice to me. Page > flags does not tell you what is going on in the VM, it gives you a tiny > window into "something". Between reading a /proc/pid/ pfn and finding > the pfn's page flags, it could be used for something completely different > anyway. I agree that exposing numerical values of page flags is not a very good idea at all. If we really want to expose this information it should at least be in string form, although that is quite a bit of a maintaince horror aswell. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig wrote: On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote: Yeah good point ;) I just meant the wider "we". With all the problems kprobes has, something like poking deep into kernel internals seems like a good thing to use it for instead of hardcoding that stuff into the kernel. If not, then why did we even merge it in the first place? It's very nice to poke deep into the kernel for development purposes. For example for the spu scheduler work I'm doing currently I have a module using kprobes (note the systemtap crap because it's big, bloated, in and odd language, and does a lot of really wrong things in it's runtime). OK, I pick systemtap because I don't know any better... but kprobes is what I mean for the kernel interface. This module allows me to put probes into various places in the scheduler and writes them into a ringbuffer with timestampts allowing me to trace what's going on there. This is really neat. Unfortunately it breaks as soon as I do some major reshuffling because then the points it hooks up to are not there anymore. That's perfectly fine for my setup, because _I_ know what I have to change when it breaks, and can easily fix that. Now imagine a similar module to trace pagecache activity used by a third-party monitoring tool. We now get a major change to the pagecache (say to make it lockless), and the probes just break. In the est case it just doesn't work anymore, in the worst case it crashes the kernel. Now if the app vendor at least gave me their source I could at least fix it to not crash, but there's a fair enough chance they poke into bits that simply aren't there anymore. Now if we have a proper user interface with real code behing it we can have a defined interface. If the interface is bad enough (or just too lowlevel) we might have the last problem of stastistic that were there once to go away aswell, but we can deal with that gracefully by declaring parts of the stats volatile and make sure people don't mess with them. To summarize, I really love kprobes to ease my debugging work, but using it for any kind of production code is a total nightmare. OK, well Matt's stuff that he needs doesn't have to be kprobes at all, and yes if lots of people want the same thing then we could distribute it with the kernel. But at least make it into its own module with a debugfs interface or something. I mean, exposing a PG_name-to-nr and page count pfn and flags as a supposedly formal proc interface doesn't sound nice to me. Page flags does not tell you what is going on in the VM, it gives you a tiny window into "something". Between reading a /proc/pid/ pfn and finding the pfn's page flags, it could be used for something completely different anyway. We could distribute some systemtap scripts, and even distribute some basic useful ones like this sort of page info in the kernel source tree. We could not really distribute systemtap scripts with the kernel. systemtap is a bloody complicated piece of shit outside the kernel tree that breaks all the time we change kernel internals. We could provide useful kprobes modules, a proper tracing system (ltt-ng-lite) and surrounding infrastucture. OK ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 08:51:42AM +0100, Christoph Hellwig wrote: > Umm, folks. systemtap basically means people compile kernel modules > from an odd scripting language with embedded C snipplets into kernel > modules. The kernel modules don't use normal exported APIs but use > kallsysms and dwarf info to poke into every possible private bit. > Saying you don't care the slightest whether oracle will load huge > amounts of code into the kernel that depends on intimate implementation > details, and that you don't even have source to to debug it is not what > I'd call "none of us need to care in the slightest", at least for those > of you working for distributions that may actually have to debug this > shit in the end. > While we're at it, what happened to the idea of tainting the kernel > as soon as krpobes are placed in the kernel to at least make people > aware of it? This is for a system monitoring app outside the database proper that just happens to be done by the same .com as makes the database. It's got little to do with the database itself apart from knowing how to tell the database to e.g. let fewer clients in. The part that deals with this is sort of like a custom procps that does things rather specifically how they need them, including being portable to other OS's IIRC, though the scope of the app is much larger than that. They're actually quite concerned about issues of this sort since they want to run all the time in the background in order to respond to system conditions, though probably not necessarily rapid-fire sorts of responses to second-by-second changes in conditions. Basically, they're not a debugging affair, and they need to be able to run in supported conditions. They're rather disinterested in things that would, say, taint the kernel or take customers out of supported configurations. They'll fall back to the known-grossly-inaccurate RSS-based estimates they're using now in preference to such. They don't want omnibus back doors into the kernel and I honestly expect them to NAK the systemtap solution. They really want the "uniquely attributable RSS" or "proportional RSS" reported directly, and it takes some doing to convince them that this can't be done directly for various reasons, e.g. floating point in the kernel won't fly. They can program sufficiently well to maintain a database of pfn's, pid's of processes mapping them, and user virtual addresses they're mapped at (easy enough to kick off a database instance for it if they don't feel comfortable just hashing the triples) and do the tabulation from there, though they're not happy having to do so much of the calculation themselves. Actually, I promised them reporting of mapcount which would make per-process UARSS/PRSS calculation able to be done on a process-by-process basis, though I can probably convince them to do whole-system pfn-by-pfn tabulation if such is lacking. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote: > Yeah good point ;) I just meant the wider "we". > > With all the problems kprobes has, something like poking deep into > kernel internals seems like a good thing to use it for instead of > hardcoding that stuff into the kernel. If not, then why did we even > merge it in the first place? It's very nice to poke deep into the kernel for development purposes. For example for the spu scheduler work I'm doing currently I have a module using kprobes (note the systemtap crap because it's big, bloated, in and odd language, and does a lot of really wrong things in it's runtime). This module allows me to put probes into various places in the scheduler and writes them into a ringbuffer with timestampts allowing me to trace what's going on there. This is really neat. Unfortunately it breaks as soon as I do some major reshuffling because then the points it hooks up to are not there anymore. That's perfectly fine for my setup, because _I_ know what I have to change when it breaks, and can easily fix that. Now imagine a similar module to trace pagecache activity used by a third-party monitoring tool. We now get a major change to the pagecache (say to make it lockless), and the probes just break. In the est case it just doesn't work anymore, in the worst case it crashes the kernel. Now if the app vendor at least gave me their source I could at least fix it to not crash, but there's a fair enough chance they poke into bits that simply aren't there anymore. Now if we have a proper user interface with real code behing it we can have a defined interface. If the interface is bad enough (or just too lowlevel) we might have the last problem of stastistic that were there once to go away aswell, but we can deal with that gracefully by declaring parts of the stats volatile and make sure people don't mess with them. To summarize, I really love kprobes to ease my debugging work, but using it for any kind of production code is a total nightmare. > We could distribute some systemtap scripts, and even distribute some > basic useful ones like this sort of page info in the kernel source > tree. We could not really distribute systemtap scripts with the kernel. systemtap is a bloody complicated piece of shit outside the kernel tree that breaks all the time we change kernel internals. We could provide useful kprobes modules, a proper tracing system (ltt-ng-lite) and surrounding infrastucture. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig wrote: On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote: Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like and none of us need to care in the slightest ;) Umm, folks. systemtap basically means people compile kernel modules from an odd scripting language with embedded C snipplets into kernel modules. The kernel modules don't use normal exported APIs but use kallsysms and dwarf info to poke into every possible private bit. Saying you don't care the slightest whether oracle will load huge amounts of code into the kernel that depends on intimate implementation details, and that you don't even have source to to debug it is not what I'd call "none of us need to care in the slightest", at least for those of you working for distributions that may actually have to debug this shit in the end. Yeah good point ;) I just meant the wider "we". With all the problems kprobes has, something like poking deep into kernel internals seems like a good thing to use it for instead of hardcoding that stuff into the kernel. If not, then why did we even merge it in the first place? We could distribute some systemtap scripts, and even distribute some basic useful ones like this sort of page info in the kernel source tree. While we're at it, what happened to the idea of tainting the kernel as soon as krpobes are placed in the kernel to at least make people aware of it? Seems like a very good idea. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote: > Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like > and none of us need to care in the slightest ;) Umm, folks. systemtap basically means people compile kernel modules from an odd scripting language with embedded C snipplets into kernel modules. The kernel modules don't use normal exported APIs but use kallsysms and dwarf info to poke into every possible private bit. Saying you don't care the slightest whether oracle will load huge amounts of code into the kernel that depends on intimate implementation details, and that you don't even have source to to debug it is not what I'd call "none of us need to care in the slightest", at least for those of you working for distributions that may actually have to debug this shit in the end. While we're at it, what happened to the idea of tainting the kernel as soon as krpobes are placed in the kernel to at least make people aware of it? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: >> The EM guys are unwilling or unable for support-oriented reasons to >> deal with anything but unmodified kernels as shipped by distros. On Fri, Apr 13, 2007 at 05:03:43PM +1000, Nick Piggin wrote: > And I think major distros ship with kprobes enabled, so that is yet > another reason why systemtap should be considered before adding these > proc interfaces. I'll have to check in and see if that will work for them. A lot of this is about customer/distro/support interaction constraints on how it works as opposed to purely technical affairs. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: Andrew Morton wrote: Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: Well _you_ do, because that happens to be exactly what you want. Bill ends up with something that displays page_mapcount instead. And I end up with something that traverses LRU lists rather than pfns. And none of it goes in /proc/ or linux-2.6/. So it isn't really the same thing at all. The EM guys aren't dealing with the database; they're dealing with some enterprise management thingie that does things like control how many client connections are allowed for each database instance. Unless they're doing less than I expect, and are largely something like procps on steroids and enterprise silliness. Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like and none of us need to care in the slightest ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: Andrew Morton wrote: Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: And my problem isn't with the hardcoded pagetable walker. Yeah, we'd probably still keep the pagetable callback walker thingy with Matt's associated cleanups (and my subsequent ones to clean it up more and move it to mm/): there are other in-kernel users for that anyway. The point is the proc API, and exposing random little parts of deep kernel internals that some people happen to find useful at the time. (which is why we have an incredible proliferation of these things). With systemtap scripts, you could walk pagetables and print *the exact page information you want*, or you could walk pfns, or LRU, or page_tree, or walk the page tree then the rmap structures. And you can selectively cull out items you don't care about if you only care about a subset of items, based on arbitrary criteria. And you can most likely do all that more efficiently than with a conglomeration of various /proc files (assuming they even provide what you want in the first place). The EM guys are unwilling or unable for support-oriented reasons to deal with anything but unmodified kernels as shipped by distros. And I think major distros ship with kprobes enabled, so that is yet another reason why systemtap should be considered before adding these proc interfaces. Thanks, Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: >> Do a full pagetable walk, with all the associated locking from within >> a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded >> in C, perhaps. Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: > And my problem isn't with the hardcoded pagetable walker. Yeah, we'd > probably still keep the pagetable callback walker thingy with Matt's > associated cleanups (and my subsequent ones to clean it up more and > move it to mm/): there are other in-kernel users for that anyway. > The point is the proc API, and exposing random little parts of deep > kernel internals that some people happen to find useful at the time. > (which is why we have an incredible proliferation of these things). > With systemtap scripts, you could walk pagetables and print *the exact > page information you want*, or you could walk pfns, or LRU, or page_tree, > or walk the page tree then the rmap structures. And you can selectively > cull out items you don't care about if you only care about a subset of > items, based on arbitrary criteria. And you can most likely do all that > more efficiently than with a conglomeration of various /proc files > (assuming they even provide what you want in the first place). The EM guys are unwilling or unable for support-oriented reasons to deal with anything but unmodified kernels as shipped by distros. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: >> Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: > Well _you_ do, because that happens to be exactly what you want. Bill > ends up with something that displays page_mapcount instead. And I > end up with something that traverses LRU lists rather than pfns. And > none of it goes in /proc/ or linux-2.6/. > So it isn't really the same thing at all. The EM guys aren't dealing with the database; they're dealing with some enterprise management thingie that does things like control how many client connections are allowed for each database instance. Unless they're doing less than I expect, and are largely something like procps on steroids and enterprise silliness. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: Well _you_ do, because that happens to be exactly what you want. Bill ends up with something that displays page_mapcount instead. And I end up with something that traverses LRU lists rather than pfns. And none of it goes in /proc/ or linux-2.6/. So it isn't really the same thing at all. The EM guys aren't dealing with the database; they're dealing with some enterprise management thingie that does things like control how many client connections are allowed for each database instance. Unless they're doing less than I expect, and are largely something like procps on steroids and enterprise silliness. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: And my problem isn't with the hardcoded pagetable walker. Yeah, we'd probably still keep the pagetable callback walker thingy with Matt's associated cleanups (and my subsequent ones to clean it up more and move it to mm/): there are other in-kernel users for that anyway. The point is the proc API, and exposing random little parts of deep kernel internals that some people happen to find useful at the time. (which is why we have an incredible proliferation of these things). With systemtap scripts, you could walk pagetables and print *the exact page information you want*, or you could walk pfns, or LRU, or page_tree, or walk the page tree then the rmap structures. And you can selectively cull out items you don't care about if you only care about a subset of items, based on arbitrary criteria. And you can most likely do all that more efficiently than with a conglomeration of various /proc files (assuming they even provide what you want in the first place). The EM guys are unwilling or unable for support-oriented reasons to deal with anything but unmodified kernels as shipped by distros. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: Andrew Morton wrote: Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: And my problem isn't with the hardcoded pagetable walker. Yeah, we'd probably still keep the pagetable callback walker thingy with Matt's associated cleanups (and my subsequent ones to clean it up more and move it to mm/): there are other in-kernel users for that anyway. The point is the proc API, and exposing random little parts of deep kernel internals that some people happen to find useful at the time. (which is why we have an incredible proliferation of these things). With systemtap scripts, you could walk pagetables and print *the exact page information you want*, or you could walk pfns, or LRU, or page_tree, or walk the page tree then the rmap structures. And you can selectively cull out items you don't care about if you only care about a subset of items, based on arbitrary criteria. And you can most likely do all that more efficiently than with a conglomeration of various /proc files (assuming they even provide what you want in the first place). The EM guys are unwilling or unable for support-oriented reasons to deal with anything but unmodified kernels as shipped by distros. And I think major distros ship with kprobes enabled, so that is yet another reason why systemtap should be considered before adding these proc interfaces. Thanks, Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: Andrew Morton wrote: Then you just end up with the same thing, don't you? On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: Well _you_ do, because that happens to be exactly what you want. Bill ends up with something that displays page_mapcount instead. And I end up with something that traverses LRU lists rather than pfns. And none of it goes in /proc/ or linux-2.6/. So it isn't really the same thing at all. The EM guys aren't dealing with the database; they're dealing with some enterprise management thingie that does things like control how many client connections are allowed for each database instance. Unless they're doing less than I expect, and are largely something like procps on steroids and enterprise silliness. Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like and none of us need to care in the slightest ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: The EM guys are unwilling or unable for support-oriented reasons to deal with anything but unmodified kernels as shipped by distros. On Fri, Apr 13, 2007 at 05:03:43PM +1000, Nick Piggin wrote: And I think major distros ship with kprobes enabled, so that is yet another reason why systemtap should be considered before adding these proc interfaces. I'll have to check in and see if that will work for them. A lot of this is about customer/distro/support interaction constraints on how it works as opposed to purely technical affairs. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote: Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like and none of us need to care in the slightest ;) Umm, folks. systemtap basically means people compile kernel modules from an odd scripting language with embedded C snipplets into kernel modules. The kernel modules don't use normal exported APIs but use kallsysms and dwarf info to poke into every possible private bit. Saying you don't care the slightest whether oracle will load huge amounts of code into the kernel that depends on intimate implementation details, and that you don't even have source to to debug it is not what I'd call none of us need to care in the slightest, at least for those of you working for distributions that may actually have to debug this shit in the end. While we're at it, what happened to the idea of tainting the kernel as soon as krpobes are placed in the kernel to at least make people aware of it? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig wrote: On Fri, Apr 13, 2007 at 05:05:47PM +1000, Nick Piggin wrote: Ah, OK. Anyway, with kprobes/systemtap they can do whatever they like and none of us need to care in the slightest ;) Umm, folks. systemtap basically means people compile kernel modules from an odd scripting language with embedded C snipplets into kernel modules. The kernel modules don't use normal exported APIs but use kallsysms and dwarf info to poke into every possible private bit. Saying you don't care the slightest whether oracle will load huge amounts of code into the kernel that depends on intimate implementation details, and that you don't even have source to to debug it is not what I'd call none of us need to care in the slightest, at least for those of you working for distributions that may actually have to debug this shit in the end. Yeah good point ;) I just meant the wider we. With all the problems kprobes has, something like poking deep into kernel internals seems like a good thing to use it for instead of hardcoding that stuff into the kernel. If not, then why did we even merge it in the first place? We could distribute some systemtap scripts, and even distribute some basic useful ones like this sort of page info in the kernel source tree. While we're at it, what happened to the idea of tainting the kernel as soon as krpobes are placed in the kernel to at least make people aware of it? Seems like a very good idea. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote: Yeah good point ;) I just meant the wider we. With all the problems kprobes has, something like poking deep into kernel internals seems like a good thing to use it for instead of hardcoding that stuff into the kernel. If not, then why did we even merge it in the first place? It's very nice to poke deep into the kernel for development purposes. For example for the spu scheduler work I'm doing currently I have a module using kprobes (note the systemtap crap because it's big, bloated, in and odd language, and does a lot of really wrong things in it's runtime). This module allows me to put probes into various places in the scheduler and writes them into a ringbuffer with timestampts allowing me to trace what's going on there. This is really neat. Unfortunately it breaks as soon as I do some major reshuffling because then the points it hooks up to are not there anymore. That's perfectly fine for my setup, because _I_ know what I have to change when it breaks, and can easily fix that. Now imagine a similar module to trace pagecache activity used by a third-party monitoring tool. We now get a major change to the pagecache (say to make it lockless), and the probes just break. In the est case it just doesn't work anymore, in the worst case it crashes the kernel. Now if the app vendor at least gave me their source I could at least fix it to not crash, but there's a fair enough chance they poke into bits that simply aren't there anymore. Now if we have a proper user interface with real code behing it we can have a defined interface. If the interface is bad enough (or just too lowlevel) we might have the last problem of stastistic that were there once to go away aswell, but we can deal with that gracefully by declaring parts of the stats volatile and make sure people don't mess with them. To summarize, I really love kprobes to ease my debugging work, but using it for any kind of production code is a total nightmare. We could distribute some systemtap scripts, and even distribute some basic useful ones like this sort of page info in the kernel source tree. We could not really distribute systemtap scripts with the kernel. systemtap is a bloody complicated piece of shit outside the kernel tree that breaks all the time we change kernel internals. We could provide useful kprobes modules, a proper tracing system (ltt-ng-lite) and surrounding infrastucture. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 08:51:42AM +0100, Christoph Hellwig wrote: Umm, folks. systemtap basically means people compile kernel modules from an odd scripting language with embedded C snipplets into kernel modules. The kernel modules don't use normal exported APIs but use kallsysms and dwarf info to poke into every possible private bit. Saying you don't care the slightest whether oracle will load huge amounts of code into the kernel that depends on intimate implementation details, and that you don't even have source to to debug it is not what I'd call none of us need to care in the slightest, at least for those of you working for distributions that may actually have to debug this shit in the end. While we're at it, what happened to the idea of tainting the kernel as soon as krpobes are placed in the kernel to at least make people aware of it? This is for a system monitoring app outside the database proper that just happens to be done by the same .com as makes the database. It's got little to do with the database itself apart from knowing how to tell the database to e.g. let fewer clients in. The part that deals with this is sort of like a custom procps that does things rather specifically how they need them, including being portable to other OS's IIRC, though the scope of the app is much larger than that. They're actually quite concerned about issues of this sort since they want to run all the time in the background in order to respond to system conditions, though probably not necessarily rapid-fire sorts of responses to second-by-second changes in conditions. Basically, they're not a debugging affair, and they need to be able to run in supported conditions. They're rather disinterested in things that would, say, taint the kernel or take customers out of supported configurations. They'll fall back to the known-grossly-inaccurate RSS-based estimates they're using now in preference to such. They don't want omnibus back doors into the kernel and I honestly expect them to NAK the systemtap solution. They really want the uniquely attributable RSS or proportional RSS reported directly, and it takes some doing to convince them that this can't be done directly for various reasons, e.g. floating point in the kernel won't fly. They can program sufficiently well to maintain a database of pfn's, pid's of processes mapping them, and user virtual addresses they're mapped at (easy enough to kick off a database instance for it if they don't feel comfortable just hashing the triples) and do the tabulation from there, though they're not happy having to do so much of the calculation themselves. Actually, I promised them reporting of mapcount which would make per-process UARSS/PRSS calculation able to be done on a process-by-process basis, though I can probably convince them to do whole-system pfn-by-pfn tabulation if such is lacking. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig wrote: On Fri, Apr 13, 2007 at 06:03:45PM +1000, Nick Piggin wrote: Yeah good point ;) I just meant the wider we. With all the problems kprobes has, something like poking deep into kernel internals seems like a good thing to use it for instead of hardcoding that stuff into the kernel. If not, then why did we even merge it in the first place? It's very nice to poke deep into the kernel for development purposes. For example for the spu scheduler work I'm doing currently I have a module using kprobes (note the systemtap crap because it's big, bloated, in and odd language, and does a lot of really wrong things in it's runtime). OK, I pick systemtap because I don't know any better... but kprobes is what I mean for the kernel interface. This module allows me to put probes into various places in the scheduler and writes them into a ringbuffer with timestampts allowing me to trace what's going on there. This is really neat. Unfortunately it breaks as soon as I do some major reshuffling because then the points it hooks up to are not there anymore. That's perfectly fine for my setup, because _I_ know what I have to change when it breaks, and can easily fix that. Now imagine a similar module to trace pagecache activity used by a third-party monitoring tool. We now get a major change to the pagecache (say to make it lockless), and the probes just break. In the est case it just doesn't work anymore, in the worst case it crashes the kernel. Now if the app vendor at least gave me their source I could at least fix it to not crash, but there's a fair enough chance they poke into bits that simply aren't there anymore. Now if we have a proper user interface with real code behing it we can have a defined interface. If the interface is bad enough (or just too lowlevel) we might have the last problem of stastistic that were there once to go away aswell, but we can deal with that gracefully by declaring parts of the stats volatile and make sure people don't mess with them. To summarize, I really love kprobes to ease my debugging work, but using it for any kind of production code is a total nightmare. OK, well Matt's stuff that he needs doesn't have to be kprobes at all, and yes if lots of people want the same thing then we could distribute it with the kernel. But at least make it into its own module with a debugfs interface or something. I mean, exposing a PG_name-to-nr and page count pfn and flags as a supposedly formal proc interface doesn't sound nice to me. Page flags does not tell you what is going on in the VM, it gives you a tiny window into something. Between reading a /proc/pid/ pfn and finding the pfn's page flags, it could be used for something completely different anyway. We could distribute some systemtap scripts, and even distribute some basic useful ones like this sort of page info in the kernel source tree. We could not really distribute systemtap scripts with the kernel. systemtap is a bloody complicated piece of shit outside the kernel tree that breaks all the time we change kernel internals. We could provide useful kprobes modules, a proper tracing system (ltt-ng-lite) and surrounding infrastucture. OK ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 06:25:46PM +1000, Nick Piggin wrote: But at least make it into its own module with a debugfs interface or something. I mean, exposing a PG_name-to-nr and page count pfn and flags as a supposedly formal proc interface doesn't sound nice to me. Page flags does not tell you what is going on in the VM, it gives you a tiny window into something. Between reading a /proc/pid/ pfn and finding the pfn's page flags, it could be used for something completely different anyway. I agree that exposing numerical values of page flags is not a very good idea at all. If we really want to expose this information it should at least be in string form, although that is quite a bit of a maintaince horror aswell. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: Andrew Morton wrote: On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin [EMAIL PROTECTED] wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. It looks like you can traverse arbitrary data structures, yes. It definitely seems like you can use some kernel functions, but the ones I saw may just be systemtap facilities. But what is so surprising about being able to call a kernel function when running in kernel context? Perhaps there is some fundamental limitation of kprobes that I don't understand. The main requirement for kprobes handlers is that they can't sleep. You could definitely call a kernel function from kprobe handlers as long as the function doesn't sleep. Ananth - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:54:36PM +1000, Nick Piggin wrote: Matt Mackall wrote: On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? I guess I'm not clear on what you're proposing. From my understanding of kprobes (admittedly not an expert), this is hard to do and not a very good match. But you have an idea that it is bad for exposing lots of data. Why? (I'm not a kprobes expert either, these are not rhetorical questions) You could tie your kprobe module to use relay channels. Kprobe handlers run lockless and using the per-cpu relay channels will provide a fast transport mechanism for exposing lots of data. http://relayfs.sourceforge.net/examples.html#tprintk_kprobes is an example using the earlier relayfs interface. It shouldn't be that hard to change it to use the newer relay stuff. AFAIK acme is using a similar mechanism for ctracer (http://oops.ghostprotocols.net:81/blog/?p=50) Ananth - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Ananth N Mavinakayanahalli wrote: On Fri, Apr 13, 2007 at 12:50:20PM +1000, Nick Piggin wrote: It definitely seems like you can use some kernel functions, but the ones I saw may just be systemtap facilities. But what is so surprising about being able to call a kernel function when running in kernel context? Perhaps there is some fundamental limitation of kprobes that I don't understand. The main requirement for kprobes handlers is that they can't sleep. You could definitely call a kernel function from kprobe handlers as long as the function doesn't sleep. That would be enough to access basically all the VM data structures. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 01:40:08PM +1000, Nick Piggin wrote: With systemtap scripts, you could walk pagetables and print *the exact page information you want*, or you could walk pfns, or LRU, or page_tree, or walk the page tree then the rmap structures. And you can selectively cull out items you don't care about if you only care about a subset of items, based on arbitrary criteria. And you can most likely do all that more efficiently than with a conglomeration of various /proc files (assuming they even provide what you want in the first place). Yes, but maintaining the systemtap scripts will be a nightmare, since they would be outside the kernel, and as we change our internal data structure, the scripts would become useless. This is a fundamental problem with systemtap that we haven't been able to solve yet, because solving it would freeze various internal data structures or kernel functions. I agree that's not acceptable; which is why I don't think systemtap would be a good match for the problem we're trying to solve here. - Ted - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 05:42:01PM -0700, Andrew Morton wrote: On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote: + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). I don't think so - this file exposes fairly deep kernel internals and that's unavoidable, really - it's *supposed* to do that. It is explicitly designed for monitoring kernel behaviour. So it needs special handling by userspace. Keeping the number of files which need such special handling to a minimum will keep the number of applications which are exposed to kernel changes to a minimum. + for (; i 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage-flags; + page[i + 1] = atomic_read(ppage-_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. We can probably fit this in the existing (variable-sized) header. I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Matt, these changes make the new field in /proc/pid/smaps redundant, don't they? Which new field? From /proc/kpagemap + /proc/*/pagemap, you can basically synthesize any statistic you want, including all the existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will be considerably more efficient. But in general, most of the statistics in smaps are basically useless for shared mappings, just like RSS. Problem is, we really don't know what statistics we want yet, or even if it can be distilled down to simple numbers anyway. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall [EMAIL PROTECTED] wrote: It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. We can probably fit this in the existing (variable-sized) header. hm, OK.. I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Matt, these changes make the new field in /proc/pid/smaps redundant, don't they? Which new field? Referenced: From /proc/kpagemap + /proc/*/pagemap, you can basically synthesize any statistic you want, including all the existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will be considerably more efficient. You'd need to poke clear_refs beforehand to make the referenced bits useful. Actually, we also need to run around the ptes and collect the pte-referenced bits too. I don't think your code copes with any of that? But in general, most of the statistics in smaps are basically useless for shared mappings, just like RSS. Problem is, we really don't know what statistics we want yet, or even if it can be distilled down to simple numbers anyway. yup. But that's the whole point, really: don't prejudge what info userspace is trying to collect. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:18:56PM +1000, Nick Piggin wrote: Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? Perhaps. My understanding is that you hook a kprobe to an event. An event is a particular instruction getting executed. Indeed, you can do whatever poking around in the kernel you want at that point. And then you can stuff that data in a buffer that eventually gets to userspace. This is very different from a read/seek/syscall. Rather than just asking the kernel for some data, we have to wait for the relevent events. Now, of course, you can make an ugly hack like hooking sys_getpid() and basically make your own system call. Hopefully no one else will call getpid() while you're doing this, etc. Not really how it's intended to work at all, and probably a bitch to use, but possible. Then the question becomes: why don't we do this for everything else in /proc? And the answer of course is: we put stuff in /proc because it's generally useful. Extra points if it's actually related to 'proc'esses. Being able to tell what's paged in in a given mapping is useful. Being able to tell what's shared between two mappings is useful. Being able to get an accurate, meaningful picture of how your memory is being used is useful. Heck, I bet some people might find it useful to be able to see what nodes the pages in their process are on. All stuff you shouldn't need to be a kernel hacker to answer. The flags part of /proc/kpagemap exposes some (very interesting!) implementation details. The rest of it is completely generic to any system with a VM. It's only deep kernel magic in the sense that it's not yet exposed. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 10:03:56AM -0700, Andrew Morton wrote: On Fri, 13 Apr 2007 11:24:36 -0500 Matt Mackall [EMAIL PROTECTED] wrote: It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. We can probably fit this in the existing (variable-sized) header. hm, OK.. I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Matt, these changes make the new field in /proc/pid/smaps redundant, don't they? Which new field? Referenced: From /proc/kpagemap + /proc/*/pagemap, you can basically synthesize any statistic you want, including all the existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will be considerably more efficient. You'd need to poke clear_refs beforehand to make the referenced bits useful. Actually, we also need to run around the ptes and collect the pte-referenced bits too. I don't think your code copes with any of that? No, and it probably should. Perhaps dirty as well, though I've kindof lost the plot on how that works lately. But in general, most of the statistics in smaps are basically useless for shared mappings, just like RSS. Problem is, we really don't know what statistics we want yet, or even if it can be distilled down to simple numbers anyway. yup. But that's the whole point, really: don't prejudge what info userspace is trying to collect. Right. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 12:24:51 -0500 Matt Mackall [EMAIL PROTECTED] wrote: From /proc/kpagemap + /proc/*/pagemap, you can basically synthesize any statistic you want, including all the existing ones. For some data, /proc/pid/smaps (or /proc/meminfo) will be considerably more efficient. You'd need to poke clear_refs beforehand to make the referenced bits useful. Actually, we also need to run around the ptes and collect the pte-referenced bits too. I don't think your code copes with any of that? No, and it probably should. Perhaps dirty as well, though I've kindof lost the plot on how that works lately. Dirty is OK: the VM keeps pte-dirtiness and page-dirtiness in sync now. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Christoph Hellwig [EMAIL PROTECTED] writes: [...] merge it in the first place? It's very nice to poke deep into the kernel for development purposes. For example for the spu scheduler work I'm doing currently I have a module using kprobes (note the systemtap crap because it's big, bloated, in and odd language, and does a lot of really wrong things in its runtime). It may be worthwhile to remind people that it is easy to use systemtap only to the extent of automating the placement of kprobes: just to perform the function-name/source-file/line-number triplet to PC mapping. They can use embedded-C code to do all the same stuff they'd do with kprobes. They are not obligated to write any odd script code for probing logic, nor indeed use any of this really wrong runtime. This module allows me to put probes into various places in the scheduler and writes them into a ringbuffer with timestampts allowing me to trace what's going on there. This is really neat. [...] Indeed, and we too try to make this simple fast: a couple of lines. [...] To summarize, I really love kprobes to ease my debugging work, but using it for any kind of production code is a total nightmare. But at some point, some interface needs to be fixed for a final user-space tool. Whether that interface fixing is performed by kernel developers being more reluctant to rewrite basic things, or by providing a proc interface, or maintaining a kprobes module does not matter. Someone will feel constrained, and someone will be liberated. One neat thing about our systemtap tool is that, no matter what layer such interfaces become fixed within, it can probably interface to them. If there is no fixed interface, it can go down to debugging info. If there are tracing hooks present, it can attach. It can make appear as unified the disparate standardization policies of different subsystems. We could distribute some systemtap scripts, and even distribute some basic useful ones like this sort of page info in the kernel source tree. We could not really distribute systemtap scripts with the kernel. systemtap is a bloody complicated piece of [software] I don't know if that should be treated a compliment to our team, for being able to work quickly on something that a full-grown kernel developer finds bloody complicated. Perhaps your information is simply outdated. Big bloated? We have several times asked for specifics rather than smears - what about it? outside the kernel tree that breaks all the time we change kernel internals. [...] That's begging the question. If kernel folks are willing to maintain some included systemtap-related code, then by definition it would not break all the time. - FChE - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? And my problem isn't with the hardcoded pagetable walker. Yeah, we'd probably still keep the pagetable callback walker thingy with Matt's associated cleanups (and my subsequent ones to clean it up more and move it to mm/): there are other in-kernel users for that anyway. The point is the proc API, and exposing random little parts of deep kernel internals that some people happen to find useful at the time. (which is why we have an incredible proliferation of these things). With systemtap scripts, you could walk pagetables and print *the exact page information you want*, or you could walk pfns, or LRU, or page_tree, or walk the page tree then the rmap structures. And you can selectively cull out items you don't care about if you only care about a subset of items, based on arbitrary criteria. And you can most likely do all that more efficiently than with a conglomeration of various /proc files (assuming they even provide what you want in the first place). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Nick Piggin wrote: Andrew Morton wrote: Then you just end up with the same thing, don't you? Well _you_ do, because that happens to be exactly what you want. Bill ends up with something that displays page_mapcount instead. And I end up with something that traverses LRU lists rather than pfns. And none of it goes in /proc/ or linux-2.6/. Oh, and you get to change it without recompiling and rebooting your kernel. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? I guess I'm not clear on what you're proposing. From my understanding of kprobes (admittedly not an expert), this is hard to do and not a very good match. But you have an idea that it is bad for exposing lots of data. Why? (I'm not a kprobes expert either, these are not rhetorical questions) From what it looks like, you can traverse data structures and copy data back to userspace. Which is what makes me think it might be suitable (or could be made suitable). Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. Traverse the LRU? Which files do they belong to? What process maps them? -ENOPARSE. Basically, any "stuff" other than what you're exposing. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. It looks like you can traverse arbitrary data structures, yes. It definitely seems like you can use some kernel functions, but the ones I saw may just be systemtap facilities. But what is so surprising about being able to call a kernel function when running in kernel context? Perhaps there is some fundamental limitation of kprobes that I don't understand. Then you just end up with the same thing, don't you? Well _you_ do, because that happens to be exactly what you want. Bill ends up with something that displays page_mapcount instead. And I end up with something that traverses LRU lists rather than pfns. And none of it goes in /proc/ or linux-2.6/. So it isn't really the same thing at all. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: > Matt Mackall wrote: > >On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: > > >>If kprobes is simply crappy and doesn't work properly for this, then I > >>could accept that. I'm not someone trying to get this info. So why can't > >>it be used? (not just for kpagemap, but for clear_refs and all that gunk > >>too). > > > > > >kprobes is good for looking at events, but bad for looking at state. > >Especially metric shitloads of state. > > Why? Why is a kprobes trap significantly more expensive than a read > syscall? I guess I'm not clear on what you're proposing. From my understanding of kprobes (admittedly not an expert), this is hard to do and not a very good match. > >>Maybe. How about LRU? Reclaim performance is bad, and you want to work out > >>which pages keep going off the end of it, or which pages keep getting > >>written out via it, or who's pages are on the active list, forcing mine > >>out. > > > > > >Those are actually probably a good match for systemtap as they're all > >events. > > Traverse the LRU? Which files do they belong to? What process maps them? -ENOPARSE. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > I guess one could generate an answer to the static question with systemtap, > > by accumulating running counts across the application lifetime and then > > snapshotting them. Sounds hard though. > > Can't you just traverse arbitrary kernel data structures at a given point > in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. You'd have to do it from boot onward to get a complete system image. One way to look at it is that systemtap can give you the derivative of the information, and you have to integrate it. So everyone keeps saying. Would you tell me why you can't just traverse the data structures in the same way as your proc handler? From the systemtap example scripts it seems like you can traverse arbitrary kernel data structures. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. Traverse the LRU? Which files do they belong to? What process maps them? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. I guess we have static analysis versus dynamic. The interfaces which Matt is proposing are suited to answering the question "what is my memory being used for" (static). They're unlikely to be useful for answering the question "what's happening in the VM" (dynamic). Systemtap is probably better for the dynamic analysis. "what is my memory being used for *now*" ;) I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote: > I guess one could generate an answer to the static question with systemtap, > by accumulating running counts across the application lifetime and then > snapshotting them. Sounds hard though. You'd have to do it from boot onward to get a complete system image. One way to look at it is that systemtap can give you the derivative of the information, and you have to integrate it. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote: Basically: to show what the hell's going on in the VM. kprobes / systemtap isn't good enough? It's not really a good match to the kprobes model. I'm not interested in events, per se. I don't want to need to know about every single alloc/free of N different varieties integrated from boot onward to build up an image of the state of the system. Instead, I want to take snapshots of the state of the VM. Systemtap can't output a large set of values? Why can't you attach a kprobe to a dummy syscall, and from there iterate over pgdat/zone/memmap and output what you want? Actually I'm surprised that kind of data querying facility isn't already in there (I haven't used it seriously though). The main goal here is to be able to answer the question "where's my memory going?". Currently you can't really give a good answer to that question from userspace because of shared mappings, etc. There are lots of secondary questions that follow on very quickly from that, like "what parts of my shared mappings are or aren't shared, and why?", "what's actually in my application's working set?" and "how much of this crap can I ditch?". I understand roughly what you want, and that you can't easily get it from /proc currently. My question at this point is just why can we not use systemtap. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: > >Instead, one says "what pages are being used by my application", then, for > > That includes unmapped pagecache being used by my application, doesn't it? > Maybe that's too hard to do via /proc so we forget about it... It'd be really nice to have a window into the pagecache too. But I for one couldn't come up with a sensible scheme for it. > >each of those pages "what is that page's state". So the first step is to > >collect all the pfns from /proc/$(pidof my-application)/pagemap and then to > >use those pfns to look the individual pages up in /proc/kpagemap. > > OK I realise you could do it that way, but systemtap can definitely be > used as a tool for understanding application behaviour in the context of > the kernel, I think? The purpose for it is so that various little bits > of deep kernel internals do not have to be exposed on a case by case basis. > > If kprobes is simply crappy and doesn't work properly for this, then I > could accept that. I'm not someone trying to get this info. So why can't > it be used? (not just for kpagemap, but for clear_refs and all that gunk > too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. > > If you really want to know "who is using page 123435" then you'd need to > > search /proc/*/pagemap. There are possibly legitimate reasons why an > > application developer would want to at least pertially perform such an > > operation ("who am I sharing with"), but I doubt if it's the common case. > > Maybe. How about LRU? Reclaim performance is bad, and you want to work out > which pages keep going off the end of it, or which pages keep getting > written out via it, or who's pages are on the active list, forcing mine > out. Those are actually probably a good match for systemtap as they're all events. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > >>Andrew Morton wrote: > > >>>It *will* be viable. If the application wants to know if a page is dirty, > >>>it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's > >>>numerical offset when inspecting fields in /proc/kpagemap. If correctly > >>>designed, such a monitoring application will be able to report upon page > >>>flags which we haven't even thought up yet. > >> > >>Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. > >>Still seems like a basically hit and miss affair to just use flags. What if > >>you want to know the process mapping a page? With systemtap or something you > >>could walk the rmap structures. What if you want to look at pages along the > >>LRU list rather than per-pfn? What about connecting pages to inodes? > > > > > > Well hang on. This isn't a tool for understanding kernel behaviour. It's > > a tool for understanding applciation behaviour. > > > > So one doesn't ask "who is mapping that page" - that's a kernel developer > > thing. > > > > Instead, one says "what pages are being used by my application", then, for > > That includes unmapped pagecache being used by my application, doesn't it? > Maybe that's too hard to do via /proc so we forget about it... Yes, harder. I'm hoping that sampling of /proc/pid/io can be used to determine pagecache use sufficiently accurately. I know of one large hosting company who are using it ("BTW, we are making great use of taskstats!! Its GREAT!") > > > each of those pages "what is that page's state". So the first step is to > > collect all the pfns from /proc/$(pidof my-application)/pagemap and then to > > use those pfns to look the individual pages up in /proc/kpagemap. > > OK I realise you could do it that way, but systemtap can definitely be > used as a tool for understanding application behaviour in the context of > the kernel, I think? The purpose for it is so that various little bits > of deep kernel internals do not have to be exposed on a case by case basis. > > If kprobes is simply crappy and doesn't work properly for this, then I > could accept that. I'm not someone trying to get this info. So why can't > it be used? (not just for kpagemap, but for clear_refs and all that gunk > too). > > > If you really want to know "who is using page 123435" then you'd need to > > search /proc/*/pagemap. There are possibly legitimate reasons why an > > application developer would want to at least pertially perform such an > > operation ("who am I sharing with"), but I doubt if it's the common case. > > Maybe. How about LRU? Reclaim performance is bad, and you want to work out > which pages keep going off the end of it, or which pages keep getting > written out via it, or who's pages are on the active list, forcing mine > out. I guess we have static analysis versus dynamic. The interfaces which Matt is proposing are suited to answering the question "what is my memory being used for" (static). They're unlikely to be useful for answering the question "what's happening in the VM" (dynamic). Systemtap is probably better for the dynamic analysis. I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote: > >Basically: to show what the hell's going on in the VM. > > kprobes / systemtap isn't good enough? It's not really a good match to the kprobes model. I'm not interested in events, per se. I don't want to need to know about every single alloc/free of N different varieties integrated from boot onward to build up an image of the state of the system. Instead, I want to take snapshots of the state of the VM. The main goal here is to be able to answer the question "where's my memory going?". Currently you can't really give a good answer to that question from userspace because of shared mappings, etc. There are lots of secondary questions that follow on very quickly from that, like "what parts of my shared mappings are or aren't shared, and why?", "what's actually in my application's working set?" and "how much of this crap can I ditch?". -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: It *will* be viable. If the application wants to know if a page is dirty, it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? Well hang on. This isn't a tool for understanding kernel behaviour. It's a tool for understanding applciation behaviour. So one doesn't ask "who is mapping that page" - that's a kernel developer thing. Instead, one says "what pages are being used by my application", then, for That includes unmapped pagecache being used by my application, doesn't it? Maybe that's too hard to do via /proc so we forget about it... each of those pages "what is that page's state". So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. OK I realise you could do it that way, but systemtap can definitely be used as a tool for understanding application behaviour in the context of the kernel, I think? The purpose for it is so that various little bits of deep kernel internals do not have to be exposed on a case by case basis. If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). > If you really want to know "who is using page 123435" then you'd need to > search /proc/*/pagemap. There are possibly legitimate reasons why an > application developer would want to at least pertially perform such an > operation ("who am I sharing with"), but I doubt if it's the common case. Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { > + ppage = pfn_to_page(pfn); > + if (!ppage) { > + page[i] = 0; > + page[i + 1] = 0; > + } else { > + page[i] = ppage->flags; > + page[i + 1] = atomic_read(>_count); > + } > + } > >>> > >>> > >>>Not a good idea to expose raw flags in this manner - it changes at the drop > >>>of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > >>>mapping to make this viable. > >> > >>I don't think it is viable because that makes the flags part of the > >>userspace ABI. > > > > > > It *will* be viable. If the application wants to know if a page is dirty, > > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's > > numerical offset when inspecting fields in /proc/kpagemap. If correctly > > designed, such a monitoring application will be able to report upon page > > flags which we haven't even thought up yet. > > Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. > Still seems like a basically hit and miss affair to just use flags. What if > you want to know the process mapping a page? With systemtap or something you > could walk the rmap structures. What if you want to look at pages along the > LRU list rather than per-pfn? What about connecting pages to inodes? Well hang on. This isn't a tool for understanding kernel behaviour. It's a tool for understanding applciation behaviour. So one doesn't ask "who is mapping that page" - that's a kernel developer thing. Instead, one says "what pages are being used by my application", then, for each of those pages "what is that page's state". So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. If you really want to know "who is using page 123435" then you'd need to search /proc/*/pagemap. There are possibly legitimate reasons why an application developer would want to at least pertially perform such an operation ("who am I sharing with"), but I doubt if it's the common case. > > But I was going to say > that satisfying an Oracle requirement is a good reason _not_ to merge it ;) > hm, yes, there's plenty of precedent for that. > (I joke!) I akpm! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage->flags; + page[i + 1] = atomic_read(>_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? I thought this type of deep poking was the whole reason the probles thingies were merged. I'm saddened that they're no good for this. I thought it would be an ideal usage :( I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Yeah it is close, they need page_mapcount I think. But I was going to say that satisfying an Oracle requirement is a good reason _not_ to merge it ;) (I joke!) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote: Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: + while (count > 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). If I send you this file dumped from a random machine, you won't know what to make of it. That's a good reason ;) I'm planning to write a trivial server to sit on, say, my embedded target and spew this over the wire to a client. Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. I wonder what they are needed for. Basically: to show what the hell's going on in the VM. kprobes / systemtap isn't good enough? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > >>+ ((char *)page)[1] = PAGE_SHIFT; > > > > > > OK. > > Shouldn't we just expose page size and endianness by other means? (another > file or > syscall). I don't think so - this file exposes fairly deep kernel internals and that's unavoidable, really - it's *supposed* to do that. It is explicitly designed for monitoring kernel behaviour. So it needs special handling by userspace. Keeping the number of files which need such special handling to a minimum will keep the number of applications which are exposed to kernel changes to a minimum. > >>+ for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { > >>+ ppage = pfn_to_page(pfn); > >>+ if (!ppage) { > >>+ page[i] = 0; > >>+ page[i + 1] = 0; > >>+ } else { > >>+ page[i] = ppage->flags; > >>+ page[i + 1] = atomic_read(>_count); > >>+ } > >>+ } > > > > > > Not a good idea to expose raw flags in this manner - it changes at the drop > > of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > > mapping to make this viable. > > I don't think it is viable because that makes the flags part of the > userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. > I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Matt, these changes make the new field in /proc/pid/smaps redundant, don't they? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote: > Andrew Morton wrote: > >On Thu, 12 Apr 2007 16:10:50 -0700 > >William Lee Irwin III <[EMAIL PROTECTED]> wrote: > > >>+ while (count > 0) { > >>+ chunk = min_t(size_t, count, PAGE_SIZE); > >>+ i = 0; > >>+ > >>+ if (pfn == -1) { > >>+ page[0] = 0; > >>+ page[1] = 0; > >>+ ((char *)page)[0] = (ntohl(1) != 1); > > > > > >OK. > > > > > >>+ ((char *)page)[1] = PAGE_SHIFT; > > > > > >OK. > > Shouldn't we just expose page size and endianness by other means? (another > file or > syscall). If I send you this file dumped from a random machine, you won't know what to make of it. I'm planning to write a trivial server to sit on, say, my embedded target and spew this over the wire to a client. > >Not a good idea to expose raw flags in this manner - it changes at the drop > >of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > >mapping to make this viable. > > I don't think it is viable because that makes the flags part of the > userspace ABI. I wonder what they are needed for. Basically: to show what the hell's going on in the VM. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: > On Thu, 12 Apr 2007 16:10:50 -0700 > William Lee Irwin III <[EMAIL PROTECTED]> wrote: > > > On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: > > > This patch series introduces /proc/pid/pagemap and /proc/kpagemap, > > > which allow detailed run-time examination of process memory usage at a > > > page granularity. > > > The first several patches whip the page-walking code introduced for > > > /proc/pid/smaps and clear_refs into a more generic form, the next > > > couple make those interfaces optional, and the last two introduce the > > > new interfaces, also optional. > > > > This solves a real-life problem for Oracle system monitoring software > > (specifically EM). Among the tasks it must carry out is determining > > per-process memory footprint of a set of cooperating tasks (i.e. Oracle > > processes). RSS is inadequate for this due to page sharing; this work > > provides sufficient information to determine what EM needs. > > I'm still dying to see what the human-readable output from this > thing looks like. Still a work-in-progress. It's a monstrous amount of data and it basically requires a GUI to really get a handle on. Here's a couple apps I've been tinkering with (aka My First GTK Apps): http://selenic.com/Screenshot-pagemap.png That's a snapshot of a live-updating image of memory usage for a running process (Galeon). Each pixel is a page. Each 32x32 block is 4MB. Mappings are dark red. Pages that are actually faulted in are bright red. You can poke around in the memory map with the mouse and highlight mappings (blue). And pages that get faulted in flash green (hard to capture in a screenshot). http://selenic.com/Screenshot-kpagemap.png And that's a live-updating image of system-wide memory usage. Bright red are pages with a count of 1, dark red are pages with higher counts. Next is to visualize slab/page cache/buddy/active/lru data as well as highlight changing pages. This isn't terribly interesting yet. It can tell you things about page cache usage and fragmentation and readahead and so on. But correlating across the two sources, we'll be able to show information like "what pages in a process are actually shared/active/lru/etc." You can take it even further by correlating the above data with symbol info from nm, /proc/pid/clear_refs, etc. Also, something I immediately noticed on looking at the raw data (cat /proc/`pidof`/pagemap | hexdump -C | less): 002c8fd0 ff ff ff ff ff ff ff ff ff ff ff ff 6d f8 03 00 |m...| 002c8fe0 6c f8 03 00 b9 f8 03 00 6b f8 03 00 6a f8 03 00 |l...k...j...| 002c8ff0 b8 f8 03 00 69 f8 03 00 68 f8 03 00 b7 f8 03 00 |i...h...| 002c9000 67 f8 03 00 66 f8 03 00 b6 f8 03 00 65 f8 03 00 |g...f...e...| 002c9010 64 f8 03 00 b5 f8 03 00 63 f8 03 00 62 f8 03 00 |d...c...b...| 002c9020 b4 f8 03 00 61 f8 03 00 60 f8 03 00 b3 f8 03 00 |a...`...| 002c9030 7f f8 03 00 7e f8 03 00 b2 f8 03 00 7d f8 03 00 |~...}...| 002c9040 7c f8 03 00 b1 f8 03 00 5f f8 03 00 5e f8 03 00 ||..._...^...| 002c9050 b0 f8 03 00 5d f8 03 00 5c f8 03 00 af f8 03 00 |]...\...| Most of the consecutive page frames are allocated in descending order (6d 6c 6b 6a ...). That's pessimal for physical merging of block I/O. Given that we theoretically fixed this long-standing problem in 2.6 but it's obviously still happening, it's clear that a little more visibility into the VM would be useful. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. Not a good idea to use page->_count: page_count() will be more stable. Otherwise OK, I guess: the interpretation of the page refcount is unlikely to change much over time. EM wants to determine page_mapcount() for the most part for the purposes of determining "uniquely attributable RSS" (my ca. 2004 nomenclature) or "proportional RSS" (mpm's more recent nomenclature); as things now stand it will have to infer them by maintaining a table of pfn's and mappings thereof, but at least that can be done with it. I don't know whether you can easily determine page_mapcount with page_count and flags, though (count gives you an educated guess, but mapcount is the real thing). page_mapcount sounds very reasonable to export. It is directly tied with the userspace concept of mapping pages. page_count doesn't seem very useful (and if you must have it, please use page_count), neither does page flags. You could have a bit indicating whether the page is free or not (but that doesn't tell you much that meminfo or zoneinfo or buddyinfo does not). Dirty/writeback/referenced/uptodate maybe?... I'm stumped, what's flags for? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: + while (count > 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage->flags; + page[i + 1] = atomic_read(>_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. I wonder what they are needed for. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> This solves a real-life problem for Oracle system monitoring software >> (specifically EM). Among the tasks it must carry out is determining >> per-process memory footprint of a set of cooperating tasks (i.e. Oracle >> processes). RSS is inadequate for this due to page sharing; this work >> provides sufficient information to determine what EM needs. On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: > Not a good idea to expose raw flags in this manner - it changes at the drop > of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > mapping to make this viable. > Not a good idea to use page->_count: page_count() will be more stable. > Otherwise OK, I guess: the interpretation of the page refcount is unlikely > to change much over time. EM wants to determine page_mapcount() for the most part for the purposes of determining "uniquely attributable RSS" (my ca. 2004 nomenclature) or "proportional RSS" (mpm's more recent nomenclature); as things now stand it will have to infer them by maintaining a table of pfn's and mappings thereof, but at least that can be done with it. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: > On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: > > This patch series introduces /proc/pid/pagemap and /proc/kpagemap, > > which allow detailed run-time examination of process memory usage at a > > page granularity. > > The first several patches whip the page-walking code introduced for > > /proc/pid/smaps and clear_refs into a more generic form, the next > > couple make those interfaces optional, and the last two introduce the > > new interfaces, also optional. > > This solves a real-life problem for Oracle system monitoring software > (specifically EM). Among the tasks it must carry out is determining > per-process memory footprint of a set of cooperating tasks (i.e. Oracle > processes). RSS is inadequate for this due to page sharing; this work > provides sufficient information to determine what EM needs. > > I'm still dying to see what the human-readable output from this thing looks like. > + * Each entry is a pair of unsigned longs representing the > + * corresponding physical page, the first containing the page flags > + * and the second containing the page use count. > + * > + * The first 4 bytes of this file form a simple header: > + * > + * first byte: 0 for big endian, 1 for little > + * second byte: page shift (eg 12 for 4096 byte pages) > + * third byte: entry size in bytes (currently either 4 or 8) > + * fourth byte: header size > > ... > > + while (count > 0) { > + chunk = min_t(size_t, count, PAGE_SIZE); > + i = 0; > + > + if (pfn == -1) { > + page[0] = 0; > + page[1] = 0; > + ((char *)page)[0] = (ntohl(1) != 1); OK. > + ((char *)page)[1] = PAGE_SHIFT; OK. > + ((char *)page)[2] = sizeof(unsigned long); OK. > + ((char *)page)[3] = KPMSIZE; OK. > + i = 2; > + pfn++; > + } > + > + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { > + ppage = pfn_to_page(pfn); > + if (!ppage) { > + page[i] = 0; > + page[i + 1] = 0; > + } else { > + page[i] = ppage->flags; > + page[i + 1] = atomic_read(>_count); > + } > + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. Not a good idea to use page->_count: page_count() will be more stable. Otherwise OK, I guess: the interpretation of the page refcount is unlikely to change much over time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: > This patch series introduces /proc/pid/pagemap and /proc/kpagemap, > which allow detailed run-time examination of process memory usage at a > page granularity. > The first several patches whip the page-walking code introduced for > /proc/pid/smaps and clear_refs into a more generic form, the next > couple make those interfaces optional, and the last two introduce the > new interfaces, also optional. This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: This patch series introduces /proc/pid/pagemap and /proc/kpagemap, which allow detailed run-time examination of process memory usage at a page granularity. The first several patches whip the page-walking code introduced for /proc/pid/smaps and clear_refs into a more generic form, the next couple make those interfaces optional, and the last two introduce the new interfaces, also optional. This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: This patch series introduces /proc/pid/pagemap and /proc/kpagemap, which allow detailed run-time examination of process memory usage at a page granularity. The first several patches whip the page-walking code introduced for /proc/pid/smaps and clear_refs into a more generic form, the next couple make those interfaces optional, and the last two introduce the new interfaces, also optional. This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. I'm still dying to see what the human-readable output from this thing looks like. looks + * Each entry is a pair of unsigned longs representing the + * corresponding physical page, the first containing the page flags + * and the second containing the page use count. + * + * The first 4 bytes of this file form a simple header: + * + * first byte: 0 for big endian, 1 for little + * second byte: page shift (eg 12 for 4096 byte pages) + * third byte: entry size in bytes (currently either 4 or 8) + * fourth byte: header size ... + while (count 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. + ((char *)page)[2] = sizeof(unsigned long); OK. + ((char *)page)[3] = KPMSIZE; OK. + i = 2; + pfn++; + } + + for (; i 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage-flags; + page[i + 1] = atomic_read(ppage-_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. Not a good idea to use page-_count: page_count() will be more stable. Otherwise OK, I guess: the interpretation of the page refcount is unlikely to change much over time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. Not a good idea to use page-_count: page_count() will be more stable. Otherwise OK, I guess: the interpretation of the page refcount is unlikely to change much over time. EM wants to determine page_mapcount() for the most part for the purposes of determining uniquely attributable RSS (my ca. 2004 nomenclature) or proportional RSS (mpm's more recent nomenclature); as things now stand it will have to infer them by maintaining a table of pfn's and mappings thereof, but at least that can be done with it. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: + while (count 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). + for (; i 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage-flags; + page[i + 1] = atomic_read(ppage-_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. I wonder what they are needed for. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. Not a good idea to use page-_count: page_count() will be more stable. Otherwise OK, I guess: the interpretation of the page refcount is unlikely to change much over time. EM wants to determine page_mapcount() for the most part for the purposes of determining uniquely attributable RSS (my ca. 2004 nomenclature) or proportional RSS (mpm's more recent nomenclature); as things now stand it will have to infer them by maintaining a table of pfn's and mappings thereof, but at least that can be done with it. I don't know whether you can easily determine page_mapcount with page_count and flags, though (count gives you an educated guess, but mapcount is the real thing). page_mapcount sounds very reasonable to export. It is directly tied with the userspace concept of mapping pages. page_count doesn't seem very useful (and if you must have it, please use page_count), neither does page flags. You could have a bit indicating whether the page is free or not (but that doesn't tell you much that meminfo or zoneinfo or buddyinfo does not). Dirty/writeback/referenced/uptodate maybe?... I'm stumped, what's flags for? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: This patch series introduces /proc/pid/pagemap and /proc/kpagemap, which allow detailed run-time examination of process memory usage at a page granularity. The first several patches whip the page-walking code introduced for /proc/pid/smaps and clear_refs into a more generic form, the next couple make those interfaces optional, and the last two introduce the new interfaces, also optional. This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. I'm still dying to see what the human-readable output from this thing looks like. Still a work-in-progress. It's a monstrous amount of data and it basically requires a GUI to really get a handle on. Here's a couple apps I've been tinkering with (aka My First GTK Apps): http://selenic.com/Screenshot-pagemap.png That's a snapshot of a live-updating image of memory usage for a running process (Galeon). Each pixel is a page. Each 32x32 block is 4MB. Mappings are dark red. Pages that are actually faulted in are bright red. You can poke around in the memory map with the mouse and highlight mappings (blue). And pages that get faulted in flash green (hard to capture in a screenshot). http://selenic.com/Screenshot-kpagemap.png And that's a live-updating image of system-wide memory usage. Bright red are pages with a count of 1, dark red are pages with higher counts. Next is to visualize slab/page cache/buddy/active/lru data as well as highlight changing pages. This isn't terribly interesting yet. It can tell you things about page cache usage and fragmentation and readahead and so on. But correlating across the two sources, we'll be able to show information like what pages in a process are actually shared/active/lru/etc. You can take it even further by correlating the above data with symbol info from nm, /proc/pid/clear_refs, etc. Also, something I immediately noticed on looking at the raw data (cat /proc/`pidof`/pagemap | hexdump -C | less): 002c8fd0 ff ff ff ff ff ff ff ff ff ff ff ff 6d f8 03 00 |m...| 002c8fe0 6c f8 03 00 b9 f8 03 00 6b f8 03 00 6a f8 03 00 |l...k...j...| 002c8ff0 b8 f8 03 00 69 f8 03 00 68 f8 03 00 b7 f8 03 00 |i...h...| 002c9000 67 f8 03 00 66 f8 03 00 b6 f8 03 00 65 f8 03 00 |g...f...e...| 002c9010 64 f8 03 00 b5 f8 03 00 63 f8 03 00 62 f8 03 00 |d...c...b...| 002c9020 b4 f8 03 00 61 f8 03 00 60 f8 03 00 b3 f8 03 00 |a...`...| 002c9030 7f f8 03 00 7e f8 03 00 b2 f8 03 00 7d f8 03 00 |~...}...| 002c9040 7c f8 03 00 b1 f8 03 00 5f f8 03 00 5e f8 03 00 ||..._...^...| 002c9050 b0 f8 03 00 5d f8 03 00 5c f8 03 00 af f8 03 00 |]...\...| Most of the consecutive page frames are allocated in descending order (6d 6c 6b 6a ...). That's pessimal for physical merging of block I/O. Given that we theoretically fixed this long-standing problem in 2.6 but it's obviously still happening, it's clear that a little more visibility into the VM would be useful. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote: Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: + while (count 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). If I send you this file dumped from a random machine, you won't know what to make of it. I'm planning to write a trivial server to sit on, say, my embedded target and spew this over the wire to a client. Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. I wonder what they are needed for. Basically: to show what the hell's going on in the VM. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote: + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). I don't think so - this file exposes fairly deep kernel internals and that's unavoidable, really - it's *supposed* to do that. It is explicitly designed for monitoring kernel behaviour. So it needs special handling by userspace. Keeping the number of files which need such special handling to a minimum will keep the number of applications which are exposed to kernel changes to a minimum. + for (; i 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage-flags; + page[i + 1] = atomic_read(ppage-_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Matt, these changes make the new field in /proc/pid/smaps redundant, don't they? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote: Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III [EMAIL PROTECTED] wrote: + while (count 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). If I send you this file dumped from a random machine, you won't know what to make of it. That's a good reason ;) I'm planning to write a trivial server to sit on, say, my embedded target and spew this over the wire to a client. Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. I wonder what they are needed for. Basically: to show what the hell's going on in the VM. kprobes / systemtap isn't good enough? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote: + for (; i 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage-flags; + page[i + 1] = atomic_read(ppage-_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? I thought this type of deep poking was the whole reason the probles thingies were merged. I'm saddened that they're no good for this. I thought it would be an ideal usage :( I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Yeah it is close, they need page_mapcount I think. But I was going to say that satisfying an Oracle requirement is a good reason _not_ to merge it ;) (I joke!) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin [EMAIL PROTECTED] wrote: + for (; i 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage-flags; + page[i + 1] = atomic_read(ppage-_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? Well hang on. This isn't a tool for understanding kernel behaviour. It's a tool for understanding applciation behaviour. So one doesn't ask who is mapping that page - that's a kernel developer thing. Instead, one says what pages are being used by my application, then, for each of those pages what is that page's state. So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. If you really want to know who is using page 123435 then you'd need to search /proc/*/pagemap. There are possibly legitimate reasons why an application developer would want to at least pertially perform such an operation (who am I sharing with), but I doubt if it's the common case. But I was going to say that satisfying an Oracle requirement is a good reason _not_ to merge it ;) hm, yes, there's plenty of precedent for that. (I joke!) I akpm! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? Well hang on. This isn't a tool for understanding kernel behaviour. It's a tool for understanding applciation behaviour. So one doesn't ask who is mapping that page - that's a kernel developer thing. Instead, one says what pages are being used by my application, then, for That includes unmapped pagecache being used by my application, doesn't it? Maybe that's too hard to do via /proc so we forget about it... each of those pages what is that page's state. So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. OK I realise you could do it that way, but systemtap can definitely be used as a tool for understanding application behaviour in the context of the kernel, I think? The purpose for it is so that various little bits of deep kernel internals do not have to be exposed on a case by case basis. If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). If you really want to know who is using page 123435 then you'd need to search /proc/*/pagemap. There are possibly legitimate reasons why an application developer would want to at least pertially perform such an operation (who am I sharing with), but I doubt if it's the common case. Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote: Basically: to show what the hell's going on in the VM. kprobes / systemtap isn't good enough? It's not really a good match to the kprobes model. I'm not interested in events, per se. I don't want to need to know about every single alloc/free of N different varieties integrated from boot onward to build up an image of the state of the system. Instead, I want to take snapshots of the state of the VM. The main goal here is to be able to answer the question where's my memory going?. Currently you can't really give a good answer to that question from userspace because of shared mappings, etc. There are lots of secondary questions that follow on very quickly from that, like what parts of my shared mappings are or aren't shared, and why?, what's actually in my application's working set? and how much of this crap can I ditch?. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Andrew Morton wrote: It *will* be viable. If the application wants to know if a page is dirty, it looks up PG_dirty in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? Well hang on. This isn't a tool for understanding kernel behaviour. It's a tool for understanding applciation behaviour. So one doesn't ask who is mapping that page - that's a kernel developer thing. Instead, one says what pages are being used by my application, then, for That includes unmapped pagecache being used by my application, doesn't it? Maybe that's too hard to do via /proc so we forget about it... Yes, harder. I'm hoping that sampling of /proc/pid/io can be used to determine pagecache use sufficiently accurately. I know of one large hosting company who are using it (BTW, we are making great use of taskstats!! Its GREAT!) each of those pages what is that page's state. So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. OK I realise you could do it that way, but systemtap can definitely be used as a tool for understanding application behaviour in the context of the kernel, I think? The purpose for it is so that various little bits of deep kernel internals do not have to be exposed on a case by case basis. If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). If you really want to know who is using page 123435 then you'd need to search /proc/*/pagemap. There are possibly legitimate reasons why an application developer would want to at least pertially perform such an operation (who am I sharing with), but I doubt if it's the common case. Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. I guess we have static analysis versus dynamic. The interfaces which Matt is proposing are suited to answering the question what is my memory being used for (static). They're unlikely to be useful for answering the question what's happening in the VM (dynamic). Systemtap is probably better for the dynamic analysis. I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: Instead, one says what pages are being used by my application, then, for That includes unmapped pagecache being used by my application, doesn't it? Maybe that's too hard to do via /proc so we forget about it... It'd be really nice to have a window into the pagecache too. But I for one couldn't come up with a sensible scheme for it. each of those pages what is that page's state. So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. OK I realise you could do it that way, but systemtap can definitely be used as a tool for understanding application behaviour in the context of the kernel, I think? The purpose for it is so that various little bits of deep kernel internals do not have to be exposed on a case by case basis. If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. If you really want to know who is using page 123435 then you'd need to search /proc/*/pagemap. There are possibly legitimate reasons why an application developer would want to at least pertially perform such an operation (who am I sharing with), but I doubt if it's the common case. Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote: Basically: to show what the hell's going on in the VM. kprobes / systemtap isn't good enough? It's not really a good match to the kprobes model. I'm not interested in events, per se. I don't want to need to know about every single alloc/free of N different varieties integrated from boot onward to build up an image of the state of the system. Instead, I want to take snapshots of the state of the VM. Systemtap can't output a large set of values? Why can't you attach a kprobe to a dummy syscall, and from there iterate over pgdat/zone/memmap and output what you want? Actually I'm surprised that kind of data querying facility isn't already in there (I haven't used it seriously though). The main goal here is to be able to answer the question where's my memory going?. Currently you can't really give a good answer to that question from userspace because of shared mappings, etc. There are lots of secondary questions that follow on very quickly from that, like what parts of my shared mappings are or aren't shared, and why?, what's actually in my application's working set? and how much of this crap can I ditch?. I understand roughly what you want, and that you can't easily get it from /proc currently. My question at this point is just why can we not use systemtap. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. You'd have to do it from boot onward to get a complete system image. One way to look at it is that systemtap can give you the derivative of the information, and you have to integrate it. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. I guess we have static analysis versus dynamic. The interfaces which Matt is proposing are suited to answering the question what is my memory being used for (static). They're unlikely to be useful for answering the question what's happening in the VM (dynamic). Systemtap is probably better for the dynamic analysis. what is my memory being used for *now* ;) I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. Traverse the LRU? Which files do they belong to? What process maps them? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. You'd have to do it from boot onward to get a complete system image. One way to look at it is that systemtap can give you the derivative of the information, and you have to integrate it. So everyone keeps saying. Would you tell me why you can't just traverse the data structures in the same way as your proc handler? From the systemtap example scripts it seems like you can traverse arbitrary kernel data structures. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin [EMAIL PROTECTED] wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? I guess I'm not clear on what you're proposing. From my understanding of kprobes (admittedly not an expert), this is hard to do and not a very good match. Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. Traverse the LRU? Which files do they belong to? What process maps them? -ENOPARSE. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/