Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/27/2010 12:23 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: (back from vacation) If so then you ignore the obvious solution to _that_ problem: dont use INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks. It's _MUCH_ faster than _any_ breakpoint based solution - literally just the cost of a function call (or not even that - i've written very fast inlined tracers - they do rock when it comes to performance). Problem solved and none of the INT3 details matters at all. However did I not think of that? Yes, and let's rip off kprobes tracing from the kernel, we can always rebuild it. Well, I'm observing an issue in a production system now. I may not want to take it down, or if I take it down I may not be able to observe it again as the problem takes a couple of days to show up, or I may not have the full source, or it takes 10 minutes to build and so an iterative edit/build/run cycle can stretch for hours. You have somewhat misconstrued my argument. What i said above is that _if_ you need extreme levels of performance you always have the option to go even faster via specialized tracing solutions. I did not promote it as a replacement solution. Specialization obviously brings in a new set of problems: infexibility and non-transparency, an example of what you gave above. Your proposed solution brings in precisely such kinds of issues, on a different level, just to improve performance at the cost of transparency and at the cost of features and robustness. We just disagree on the intrusiveness, then. IMO it will be a very rare application that really suffers from a vma injection, since most apps don't manage their vmas directly but leave it to the kernel and ld.so. It's btw rather ironic as your arguments are somewhat similar to the Xen vs. KVM argument just turned around: KVM started out slower by relying on hardware implementation for virtualization while Xen relied on a clever but limiting hack. With each CPU generation the hardware got faster, while the various design limitations of Xen are hurting it and KVM is winning that race. A (partially) similar situation exists here: INT3 into ring 0 and handling it there in a protected environment might be more expensive, but _if_ it matters to performance it sure could be made faster in hardware (and in fact it will become faster with every new generation of hardware). Not at all. For kvm hardware eliminates exits completely where pv Xen tries to reduce their cost, but an INT3 will be forever much more expensive than a jump. You are right however that we should favour hardware support where available, and for high bandwidth tracing, it is available: branch trace store. With that, it is easy to know how many times the processor passed through some code point as well as to reconstruct the entire call chain, basically what the function tracer does for the kernel. Do we have facilities for exposing that to userspace? It can also be very useful for the kernel. It will still be slower if we only trace a few points, and it can't trace register and memory values, but it's a good tool to have IMO. Both Peter and me are telling you that we are considering your solution too specialized, at the cost of flexibility, features and robustness. We'll agree to disagree on that then. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/27/2010 10:24 AM, Ingo Molnar wrote: Not to mention that that process could wreck the trace data rendering it utterly unreliable. It could, but it also might not. Are we going to deny high performance tracing to users just because it doesn't work in all cases? Tracing and monitoring is foremost about being able to trust the instrument, then about performance and usability. That's one of the big things about ftrace and perf. By proposing 'user space tracing' you are missing two big aspects: - That self-contained, kernel-driven tracing can be replicated in user-space. It cannot. Sharing and global state is much harder to maintain reliably, but the bigger problem is that user-space can stomp on its own tracing state and can make it unreliable. Tracing is often used to figure out bugs, and tracers will be trusted less if they can stomp on themselves. - That somehow it's much faster and that this edge matters. It isnt and it doesnt matter. The few places that need very very fast tracing wont use any of these facilities - it will use something specialized. So you are creating a solution for special cases that dont need it, and you are also ignoring prime qualities of a good tracing framework. I see it exactly the opposite. Only a very small minority of cases will have such severe memory corruption that tracing will fall apart because of random writes to memory; especially on 64-bit where the address space is sparse. On the other hand, knowing that the cost is a few dozen cycles rather than a thousand or so means that you can trace production servers running full loads without worrying about whether tracing will affect whatever it is you're trying to observe. I'm not against slow reliable tracing, but we shouldn't ignore the need for speed. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
* Avi Kivity a...@redhat.com wrote: On 01/27/2010 10:24 AM, Ingo Molnar wrote: Not to mention that that process could wreck the trace data rendering it utterly unreliable. It could, but it also might not. Are we going to deny high performance tracing to users just because it doesn't work in all cases? Tracing and monitoring is foremost about being able to trust the instrument, then about performance and usability. That's one of the big things about ftrace and perf. By proposing 'user space tracing' you are missing two big aspects: - That self-contained, kernel-driven tracing can be replicated in user-space. It cannot. Sharing and global state is much harder to maintain reliably, but the bigger problem is that user-space can stomp on its own tracing state and can make it unreliable. Tracing is often used to figure out bugs, and tracers will be trusted less if they can stomp on themselves. - That somehow it's much faster and that this edge matters. It isnt and it doesnt matter. The few places that need very very fast tracing wont use any of these facilities - it will use something specialized. So you are creating a solution for special cases that dont need it, and you are also ignoring prime qualities of a good tracing framework. I see it exactly the opposite. Only a very small minority of cases will have such severe memory corruption that tracing will fall apart because of random writes to memory; especially on 64-bit where the address space is sparse. On the other hand, knowing that the cost is a few dozen cycles rather than a thousand or so means that you can trace production servers running full loads without worrying about whether tracing will affect whatever it is you're trying to observe. I'm not against slow reliable tracing, but we shouldn't ignore the need for speed. I havent seen a conscise summary of your points in this thread, so let me summarize it as i've understood them (hopefully not putting words into your mouth): AFAICS you are arguing for some crazy fragile architecture-specific solution that traps INT3 into ring3 just to shave off a few cycles, and then use user-space state to trace into. If so then you ignore the obvious solution to _that_ problem: dont use INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks. It's _MUCH_ faster than _any_ breakpoint based solution - literally just the cost of a function call (or not even that - i've written very fast inlined tracers - they do rock when it comes to performance). Problem solved and none of the INT3 details matters at all. INT3 only matters to _transparent_ probing, and for that, the cost of INT3 is almost _by definition_ less important than the fact that we can do transparent tracing. If performance were the overriding issue they'd use dedicated callbacks - and the INT3 technique wouldnt matter at all. ( Also, just like we were able to extend the kprobes code with more and more optimizations, the same can be done with any user-space probing as well, to make it faster. But at the core of it has to be a sane design that is transparent and controlled by the kernel, so that it has the option to apply more and more otimizations - yours isnt such and its limitations are designed-in. Which is neither smart nor useful. ) Ingo
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
* Avi Kivity a...@redhat.com wrote: If so then you ignore the obvious solution to _that_ problem: dont use INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks. It's _MUCH_ faster than _any_ breakpoint based solution - literally just the cost of a function call (or not even that - i've written very fast inlined tracers - they do rock when it comes to performance). Problem solved and none of the INT3 details matters at all. However did I not think of that? Yes, and let's rip off kprobes tracing from the kernel, we can always rebuild it. Well, I'm observing an issue in a production system now. I may not want to take it down, or if I take it down I may not be able to observe it again as the problem takes a couple of days to show up, or I may not have the full source, or it takes 10 minutes to build and so an iterative edit/build/run cycle can stretch for hours. You have somewhat misconstrued my argument. What i said above is that _if_ you need extreme levels of performance you always have the option to go even faster via specialized tracing solutions. I did not promote it as a replacement solution. Specialization obviously brings in a new set of problems: infexibility and non-transparency, an example of what you gave above. Your proposed solution brings in precisely such kinds of issues, on a different level, just to improve performance at the cost of transparency and at the cost of features and robustness. It's btw rather ironic as your arguments are somewhat similar to the Xen vs. KVM argument just turned around: KVM started out slower by relying on hardware implementation for virtualization while Xen relied on a clever but limiting hack. With each CPU generation the hardware got faster, while the various design limitations of Xen are hurting it and KVM is winning that race. A (partially) similar situation exists here: INT3 into ring 0 and handling it there in a protected environment might be more expensive, but _if_ it matters to performance it sure could be made faster in hardware (and in fact it will become faster with every new generation of hardware). Both Peter and me are telling you that we are considering your solution too specialized, at the cost of flexibility, features and robustness. Thanks, Ingo
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sun 2010-01-17 16:01:46, Peter Zijlstra wrote: On Sun, 2010-01-17 at 16:56 +0200, Avi Kivity wrote: On 01/17/2010 04:52 PM, Peter Zijlstra wrote: Also, if its fixed size you're imposing artificial limits on the number of possible probes. Obviously we'll need a limit, a uprobe will also take kernel memory, we can't allow people to exhaust it. Only if its unprivilidged, kernel and root should be able to place as many probes until the machine keels over. Well, it is address space that limits you in both cases... -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, Jan 18, 2010 at 02:15:51PM +0100, Peter Zijlstra wrote: On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote: On 01/18/2010 02:14 PM, Peter Zijlstra wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. With CPL2 or RPL on user segments the protection issue seems to be manageable for running the instructions from kernel space. CPL2 gives unrestricted access to the kernel address space; and RPL does not affect page level protection. Segment limits don't work on x86-64. But perhaps I missed something - these things are tricky. So setting RPL to 3 on the user segments allows access to kernel pages just fine? How useful.. :/ It should be possible to translate the instruction into an address space check, followed by the action, but that's still slower due to privilege level switches. Well, if you manage to do the address validation you don't need the priv level switch anymore, right? It also starts becoming very x86-centric though, doesn't it? It might kick other ports later. What is there at the moment is storing the copied instructions in a VMA. The most unpalatable part of that to me is that it's visible to userspace, probably via /proc/ and I didn't check, but I hope an munmap() from userspace cannot delete it. What the VMA has going for it is that it *appears* to be easier to port to other architectures than the alternatives, certainly easier to handle than instruction emulation. Are the ins encodings sane enough to recognize mem parameters without needing to know the actual ins? How about using a hw-breakpoint to close the gap for the inline single step? You could even re-insert the int3 lazily when you need the hw-breakpoint again. It would consume one hw-breakpoint register for each task/cpu that has probes though.. This feels very racy. Along with that, making these sort of changes was considered a risky venture on x86 and needed strong verification from elsewhere (http://lkml.org/lkml/2010/1/12/300). There are probably similar concerns on other architectures that would make a reliable port difficult. Right now the approach is with VMAs. The alternatives are 1. reserved XOL page (similar disadvantages to the VMA) 2. emulated instructions This is an emulation bug waiting to happen in my opinion and makes porting uprobes a significantly more difficult undertaking than either the XOL-VMA or XOL-page approach 3. XOL page in kernel space available at a different CPL This assumes all target architectures have a usable privilege ring which may be the case. However, I would guess that it is going to perform worse than the current approach because of the change in privilege level. No idea what the cost of a privilege level change is, but I doubt it's free 4. Boosted probes (arch-specific, apparently only x86 does this for kprobes) As unpalatable as the VMA is, I am failing to see why it's not a reasonable starting point with an understanding that 2 or 3 would be implemented in the future after the other architecture ports are in place and the reliability of the options as well as the performance can be measured. There would appear to be two classes of application that might suffer from the VMA. The first which need absolutly every single ounce of address space. The second which introspects itself via /proc/self/maps and makes decisions based on that. The first is unfortunate but should be a limited number of use cases. The second could be fudged by simply not exporting the information via /proc. I'm of the opinion it would be reasonable to let the VMA go ahead, look at the ports for the other architectures and revisit options 2 and 3 above to see if the VMA can really be removed with performance or reliability penalty. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/19/2010 07:47 PM, Jim Keniston wrote: This is still with a kernel entry, yes? Yes, this involves setting a breakpoint and trapping into the kernel when it's hit. The 6-7x figure is with the current 2-trap approach (breakpoint, single-step). Boosting could presumably make that more like 12-14x. A trap is IIRC ~1000 cycles, we can reduce this to ~50 (totally negligible from the executed code's point of view). Do you have plans for a variant that's completely in userspace? I don't know of any such plans, but I'd be interested to read more of your thoughts here. As I understand it, you've suggested replacing the probed instruction with a jump into an instrumentation vma (the XOL area, or something similar). Masami has demonstrated -- through his djprobes enhancement to kprobes -- that this can be done for many x86 instructions. What does the code in the jumped-to vma do? 1. Write a trace entry into shared memory, trap into the kernel on overflow. 2. Trap if a condition is satisfied (fast watchpoint implementation). Is the instrumentation code that corresponds to the uprobe handlers encoded in an ad hoc .so? Looks like a good idea, but it doesn't matter much to me. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Wed, 2010-01-20 at 11:43 +0200, Avi Kivity wrote: 1. Write a trace entry into shared memory, trap into the kernel on overflow. 2. Trap if a condition is satisfied (fast watchpoint implementation). So now you want to consume more of a process' address space to store trace data as well? Not to mention that that process could wreck the trace data rendering it utterly unreliable.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Wed, Jan 20, 2010 at 12:06:20PM +0530, Srikar Dronamraju wrote: * Frederic Weisbecker fweis...@gmail.com [2010-01-19 19:06:12]: On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote: What does the code in the jumped-to vma do? Is the instrumentation code that corresponds to the uprobe handlers encoded in an ad hoc .so? Once the instrumentation is requested by a process that is not the instrumented one, this looks impossible to set a uprobe without a minimal voluntary collaboration from the instrumented process (events sent through IPC or whatever). So that looks too limited, this is not anymore a true dynamic uprobe. I dont see a case where the thread being debugged refuses to place a probe unless the process is exiting. The traced process doesnt decide if it wants to be probed or not. There could be a slight delay from the time the tracer requested to the time the probe is placed. But this delay in only affecting the tracer and the tracee. This is in contract to say stop_machine where the threads of other applications are also affected. I did not think about a kind of trace point inserted in a shared memory. I was just confused :)
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/20/2010 11:57 AM, Peter Zijlstra wrote: On Wed, 2010-01-20 at 11:43 +0200, Avi Kivity wrote: 1. Write a trace entry into shared memory, trap into the kernel on overflow. 2. Trap if a condition is satisfied (fast watchpoint implementation). So now you want to consume more of a process' address space to store trace data as well? Yes. I know I'm bad. Not to mention that that process could wreck the trace data rendering it utterly unreliable. It could, but it also might not. Are we going to deny high performance tracing to users just because it doesn't work in all cases? Note this applies to any kind of monitoring or debugging technology. A process can be influenced by the debugger and render any debug info you get out of it unreliable. One non-timing example is a process using a checksum of its text as an input to some algorithm. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/20/2010 12:45 PM, Srikar Dronamraju wrote: What does the code in the jumped-to vma do? 1. Write a trace entry into shared memory, trap into the kernel on overflow. 2. Trap if a condition is satisfied (fast watchpoint implementation). That looks to be a nice idea. We should certainly look into this possibility. However can we look at this option probably a little later? Our plan was to do one step at a time i.e have the basic uprobes in first and target the booster (i.e jump to the next instruction without the need for single-stepping next). We could look at this option of using jump instead of int3 after we are done with the booster. Hope that's okay. I'm all for incremental development and merging, as long as we keep the interfaces flexible enough for the future. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Peter Zijlstra pet...@infradead.org writes: With CPL2 or RPL on user segments the protection issue seems to be manageable for running the instructions from kernel space. Nope -- it doesn't work on 64bit and even on 32bit can have large costs on some CPUs. Also designing 32bit only features in 2010 would seem rather unfortunate. -Andi -- a...@linux.intel.com -- Speaking for myself only.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Frederic Weisbecker wrote: On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote: Do you have plans for a variant that's completely in userspace? I don't know of any such plans, but I'd be interested to read more of your thoughts here. As I understand it, you've suggested replacing the probed instruction with a jump into an instrumentation vma (the XOL area, or something similar). Masami has demonstrated -- through his djprobes enhancement to kprobes -- that this can be done for many x86 instructions. What does the code in the jumped-to vma do? Is the instrumentation code that corresponds to the uprobe handlers encoded in an ad hoc .so? Once the instrumentation is requested by a process that is not the instrumented one, this looks impossible to set a uprobe without a minimal voluntary collaboration from the instrumented process (events sent through IPC or whatever). So that looks too limited, this is not anymore a true dynamic uprobe. Agreed. Since uprobe's handler must be running in kernel, we need to jump into kernel space anyway. Booster (just skips a single-stepping(trap) exception) may be useful for improving uprobe performance. And also as Andi said, using jump instead of int3 in userspace has 2GB address space limitation. It's not a problem for kernel inside, but a big problem in userspace. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/19/2010 12:15 AM, Jim Keniston wrote: I don't like the idea but if the performance benefits are real (are they?), Based on what seems to be the closest thing to an apples-to-apples comparison -- counting the number of calls to a specified function -- uprobes is 6-7 times faster than the ptrace-based equivalent, ltrace -c. And of course, uprobes provides much, much more flexibility, appears to scale better, and works with multithreaded apps. Likewise, FWIW, utrace is more than 10x faster than strace -c in counting system calls. This is still with a kernel entry, yes? Do you have plans for a variant that's completely in userspace? -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Tue, 2010-01-19 at 10:07 +0200, Avi Kivity wrote: On 01/19/2010 12:15 AM, Jim Keniston wrote: I don't like the idea but if the performance benefits are real (are they?), Based on what seems to be the closest thing to an apples-to-apples comparison -- counting the number of calls to a specified function -- uprobes is 6-7 times faster than the ptrace-based equivalent, ltrace -c. And of course, uprobes provides much, much more flexibility, appears to scale better, and works with multithreaded apps. Likewise, FWIW, utrace is more than 10x faster than strace -c in counting system calls. This is still with a kernel entry, yes? Yes, this involves setting a breakpoint and trapping into the kernel when it's hit. The 6-7x figure is with the current 2-trap approach (breakpoint, single-step). Boosting could presumably make that more like 12-14x. Do you have plans for a variant that's completely in userspace? I don't know of any such plans, but I'd be interested to read more of your thoughts here. As I understand it, you've suggested replacing the probed instruction with a jump into an instrumentation vma (the XOL area, or something similar). Masami has demonstrated -- through his djprobes enhancement to kprobes -- that this can be done for many x86 instructions. What does the code in the jumped-to vma do? Is the instrumentation code that corresponds to the uprobe handlers encoded in an ad hoc .so? BTW, when some people say completely in userspace, they mean something like ptrace, where the kernel is still heavily involved but the instrumentation code runs in user space. The ubp layer is intended to support that model as well. In our various implementations of the XOL vma/address area, however, the XOL area is either created on exec or created/expanded only by the probed process. Jim
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote: Do you have plans for a variant that's completely in userspace? I don't know of any such plans, but I'd be interested to read more of your thoughts here. As I understand it, you've suggested replacing the probed instruction with a jump into an instrumentation vma (the XOL area, or something similar). Masami has demonstrated -- through his djprobes enhancement to kprobes -- that this can be done for many x86 instructions. What does the code in the jumped-to vma do? Is the instrumentation code that corresponds to the uprobe handlers encoded in an ad hoc .so? Once the instrumentation is requested by a process that is not the instrumented one, this looks impossible to set a uprobe without a minimal voluntary collaboration from the instrumented process (events sent through IPC or whatever). So that looks too limited, this is not anymore a true dynamic uprobe.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
* Frederic Weisbecker fweis...@gmail.com [2010-01-19 19:06:12]: On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote: What does the code in the jumped-to vma do? Is the instrumentation code that corresponds to the uprobe handlers encoded in an ad hoc .so? Once the instrumentation is requested by a process that is not the instrumented one, this looks impossible to set a uprobe without a minimal voluntary collaboration from the instrumented process (events sent through IPC or whatever). So that looks too limited, this is not anymore a true dynamic uprobe. I dont see a case where the thread being debugged refuses to place a probe unless the process is exiting. The traced process doesnt decide if it wants to be probed or not. There could be a slight delay from the time the tracer requested to the time the probe is placed. But this delay in only affecting the tracer and the tracee. This is in contract to say stop_machine where the threads of other applications are also affected.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 09:45 AM, Peter Zijlstra wrote: This is debugging. We're playing with registers, we're playing with the cpu, we're playing with memory contents. Why not the address space as well? Because you want thins go to be as transparent as possible in order to avoid heisenbugs. Sure we cannot avoid everything, but we should avoid everything we possibly can. If we reserve some address space, you don't add any heisenbugs (at least, not any additional ones over emulation). Even if we don't, address space layout randomization means we're not keeping the address space layout constant between runs anyway. Also, aside of the VDSO, we simply do not force map things into address spaces (and like said before, I think the VDSO stinks for doing that) and I think we don't want to create (more) precedents in this case. You've made it clear that you don't like it, but not why. The kernel already manages the user's address space (except for MAP_FIXED which is unreliable unless you've already reserved the address space). I don't see why adding a vma for debugging is so horrible. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote: You've made it clear that you don't like it, but not why. The kernel already manages the user's address space (except for MAP_FIXED which is unreliable unless you've already reserved the address space). I don't see why adding a vma for debugging is so horrible. Well, the kernel only does what the user (and loader) tell it through mmap(). Other than that we never (except this VDSO thing) inject vmas, and I see no reason to start doing that now.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote: If we reserve some address space, you don't add any heisenbugs (at least, not any additional ones over emulation). Even if we don't, address space layout randomization means we're not keeping the address space layout constant between runs anyway. Well, it still limits the number of probes to the reserved area. If you want more you need to grow the area.. which then changes the state.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 01:44 PM, Peter Zijlstra wrote: On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote: You've made it clear that you don't like it, but not why. The kernel already manages the user's address space (except for MAP_FIXED which is unreliable unless you've already reserved the address space). I don't see why adding a vma for debugging is so horrible. Well, the kernel only does what the user (and loader) tell it through mmap(). What I meant was that the kernel chooses the addresses (unless you go the MAP_FIXED way). From the user's point of view, there is no change in behaviour: the kernel picks an address. If the constraints have changed (because we reserve a range), that doesn't affect the user. Other than that we never (except this VDSO thing) inject vmas, and I see no reason to start doing that now. Maybe you place no value on uprobes. But people who debug userspace likely will see a reason. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote: Maybe you place no value on uprobes. But people who debug userspace likely will see a reason. I do see value in uprobes, I just don't like it mucking about with the address space. Nor does it appear required.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 02:06 PM, Peter Zijlstra wrote: On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote: Maybe you place no value on uprobes. But people who debug userspace likely will see a reason. I do see value in uprobes, I just don't like it mucking about with the address space. Nor does it appear required. Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Hi Avi, On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote: Maybe you place no value on uprobes. But people who debug userspace likely will see a reason. On 01/18/2010 02:06 PM, Peter Zijlstra wrote: I do see value in uprobes, I just don't like it mucking about with the address space. Nor does it appear required. On Mon, Jan 18, 2010 at 2:09 PM, Avi Kivity a...@redhat.com wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. So how big chunks of the address space are we talking here for uprobes?
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 02:13 PM, Pekka Enberg wrote: So how big chunks of the address space are we talking here for uprobes? That's for the authors to answer, but at a guess, 32 bytes per probe (largest x86 instruction is 15 bytes), so 32 MB will give you a million probes. That's a piece of cake for x86-64, probably harder to justify for i386. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:17 +0200, Avi Kivity wrote: On 01/18/2010 02:13 PM, Pekka Enberg wrote: So how big chunks of the address space are we talking here for uprobes? That's for the authors to answer, but at a guess, 32 bytes per probe (largest x86 instruction is 15 bytes), so 32 MB will give you a million probes. That's a piece of cake for x86-64, probably harder to justify for i386. Yeah, I'm aware of people turning off address space randomization to gain more virtual space on i386, I'm pretty sure those folks aren't going to be happy if we shrink it. Let alone them trying to probe their app.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
* Avi Kivity a...@redhat.com [2010-01-18 14:17:10]: On 01/18/2010 02:13 PM, Pekka Enberg wrote: So how big chunks of the address space are we talking here for uprobes? That's for the authors to answer, but at a guess, 32 bytes per probe (largest x86 instruction is 15 bytes), so 32 MB will give you a million probes. That's a piece of cake for x86-64, probably harder to justify for i386. On x86, each probe takes 16 bytes. In the current implementation of XOL, the first hit of a breakpoint, requires us to allocate a page. If that page does get full with active breakpoints, we expand / add a page. There is a bit map that keeps a check to see if a previously used breakpoint is removed and hence that slot can be reused. By active breakpoints, I refer to those that are inserted, and has been trapped atleast once but not yet removed. Jim did try a few other allocation techniques but those that involved slot stealing did end up having locking. People who did look at that code did advise us to reduce the locking and keep the allocation simple (atleast for the first cut). -- Thanks and Regards Srikar -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, Jan 18, 2010 at 2:44 PM, Srikar Dronamraju sri...@linux.vnet.ibm.com wrote: * Avi Kivity a...@redhat.com [2010-01-18 14:17:10]: On 01/18/2010 02:13 PM, Pekka Enberg wrote: So how big chunks of the address space are we talking here for uprobes? That's for the authors to answer, but at a guess, 32 bytes per probe (largest x86 instruction is 15 bytes), so 32 MB will give you a million probes. That's a piece of cake for x86-64, probably harder to justify for i386. On x86, each probe takes 16 bytes. And how many probes do we expected to be live at the same time in real-world scenarios? I guess Avi's one million is more than enough?
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 02:51 PM, Pekka Enberg wrote: And how many probes do we expected to be live at the same time in real-world scenarios? I guess Avi's one million is more than enough? I don't think a user will ever come close to a million, but we can expect some inflation from inlined functions (I don't know if uprobes replicates such probes, but if it doesn't, it should). -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 02:51 PM, Pekka Enberg wrote: And how many probes do we expected to be live at the same time in real-world scenarios? I guess Avi's one million is more than enough? Avi Kivity kirjoitti: I don't think a user will ever come close to a million, but we can expect some inflation from inlined functions (I don't know if uprobes replicates such probes, but if it doesn't, it should). Right. I guess we're looking at few megabytes of the address space for normal scenarios which doesn't seem too excessive. However, as Peter pointed out, the bigger problem is that now we're opening the door for other features to steal chunks of the address space. And I think it's a legitimate worry that it's going to cause problems for 32-bit in the future. I don't like the idea but if the performance benefits are real (are they?), maybe it's a worthwhile trade-off. Dunno. Pekka
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 02:57 PM, Pekka Enberg wrote: On 01/18/2010 02:51 PM, Pekka Enberg wrote: And how many probes do we expected to be live at the same time in real-world scenarios? I guess Avi's one million is more than enough? Avi Kivity kirjoitti: I don't think a user will ever come close to a million, but we can expect some inflation from inlined functions (I don't know if uprobes replicates such probes, but if it doesn't, it should). Right. I guess we're looking at few megabytes of the address space for normal scenarios which doesn't seem too excessive. However, as Peter pointed out, the bigger problem is that now we're opening the door for other features to steal chunks of the address space. And I think it's a legitimate worry that it's going to cause problems for 32-bit in the future. I don't like the idea but if the performance benefits are real (are they?), maybe it's a worthwhile trade-off. Dunno. If uprobes can trace to buffer memory in the process address space, I think the win can be dramatic. Incidentally it will require injecting even more vmas into a process. Basically it means very low cost tracing, like the kernel tracers. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote: On 01/18/2010 02:14 PM, Peter Zijlstra wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. With CPL2 or RPL on user segments the protection issue seems to be manageable for running the instructions from kernel space. CPL2 gives unrestricted access to the kernel address space; and RPL does not affect page level protection. Segment limits don't work on x86-64. But perhaps I missed something - these things are tricky. So setting RPL to 3 on the user segments allows access to kernel pages just fine? How useful.. :/ It should be possible to translate the instruction into an address space check, followed by the action, but that's still slower due to privilege level switches. Well, if you manage to do the address validation you don't need the priv level switch anymore, right? Are the ins encodings sane enough to recognize mem parameters without needing to know the actual ins? How about using a hw-breakpoint to close the gap for the inline single step? You could even re-insert the int3 lazily when you need the hw-breakpoint again. It would consume one hw-breakpoint register for each task/cpu that has probes though..
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 03:15 PM, Peter Zijlstra wrote: On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote: On 01/18/2010 02:14 PM, Peter Zijlstra wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. With CPL2 or RPL on user segments the protection issue seems to be manageable for running the instructions from kernel space. CPL2 gives unrestricted access to the kernel address space; and RPL does not affect page level protection. Segment limits don't work on x86-64. But perhaps I missed something - these things are tricky. So setting RPL to 3 on the user segments allows access to kernel pages just fine? How useful.. :/ The further we stay away from segmentation, the better. Thankfully AMD removed hardware task switching from x86-64 so we can't even think about that. It should be possible to translate the instruction into an address space check, followed by the action, but that's still slower due to privilege level switches. Well, if you manage to do the address validation you don't need the priv level switch anymore, right? Right. Are the ins encodings sane enough to recognize mem parameters without needing to know the actual ins? No. You need to know whether the instruction accesses memory or not. Look at the tables at the beginning of arch/x86/kvm/emulate.c. Opcodes marked with ModRM, BitOp, MemAbs, String, Stack are all different styles of memory instructions. You need to know the operand size for the edge cases. And there are probably a few special cases in the code. How about using a hw-breakpoint to close the gap for the inline single step? You could even re-insert the int3 lazily when you need the hw-breakpoint again. It would consume one hw-breakpoint register for each task/cpu that has probes though.. If you have more than four threads, it breaks, no? And you need an IPI each time you hit the breakpoint. Ultimately I'd like to see the breakpoint avoided as well, use a jump to the XOL area and trace in ~20 cycles instead of ~1000. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:53 +0200, Avi Kivity wrote: On 01/18/2010 02:51 PM, Pekka Enberg wrote: And how many probes do we expected to be live at the same time in real-world scenarios? I guess Avi's one million is more than enough? I don't think a user will ever come close to a million, but we can expect some inflation from inlined functions (I don't know if uprobes replicates such probes, but if it doesn't, it should). SystemTap by default places probes on all instances of an inlined function. It is still hard to get to a million probes though. $ stap -v -l 'process(/usr/bin/emacs).function(*)' [...] Pass 2: analyzed script: 4359 probe(s) You can try probing all statements (for every function, in every file, on every line of source code), but even that only adds up to ten thousands of probes: $ stap -v -l 'process(/usr/bin/emacs).statement(*...@*:*)' [...] Pass 2: analyzed script: 39603 probe(s) So a million is pretty far out, even if you add larger programs and all the shared libraries they are using. As Srikar said the current allocation technique is the simplest you can do, one xol slot for each uprobe. But there are other techniques that you can use. Theoretically you only need a xol slot for each thread of a process that simultaneously hits a uprobe instance. That requires a bit more bookkeeping. The variant of uprobes that systemtap uses at the moment does that. But the locking in that case is pretty tricky, so it seemed easier to first get the code with the simplest xol allocation technique upstream. But if you do that than you can use a very small xol area to support millions of uprobes and only have to expand it when there are hundreds of threads in a process all hitting the probes simultaneously. Cheers, Mark
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, Jan 18, 2010 at 02:15:51PM +0100, Peter Zijlstra wrote: On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote: On 01/18/2010 02:14 PM, Peter Zijlstra wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. With CPL2 or RPL on user segments the protection issue seems to be manageable for running the instructions from kernel space. CPL2 gives unrestricted access to the kernel address space; and RPL does not affect page level protection. Segment limits don't work on x86-64. But perhaps I missed something - these things are tricky. So setting RPL to 3 on the user segments allows access to kernel pages just fine? How useful.. :/ It should be possible to translate the instruction into an address space check, followed by the action, but that's still slower due to privilege level switches. Well, if you manage to do the address validation you don't need the priv level switch anymore, right? Are the ins encodings sane enough to recognize mem parameters without needing to know the actual ins? How about using a hw-breakpoint to close the gap for the inline single step? You could even re-insert the int3 lazily when you need the hw-breakpoint again. It would consume one hw-breakpoint register for each task/cpu that has probes though.. A very scarce resource that it is, well, sometimes all that we might have is just one hw-breakpoint register (like older PPC64 with 1 IABR) in the system. If one process/thread consumes it, then all other contenders (from both kernel and user-space) are prevented from acquiring it. Also to mention the existence of processors with no support for instruction breakpoints. Thanks, K.Prasad
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, Jan 18, 2010 at 02:13:25PM +0200, Pekka Enberg wrote: Hi Avi, On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote: Maybe you place no value on uprobes. But people who debug userspace likely will see a reason. On 01/18/2010 02:06 PM, Peter Zijlstra wrote: I do see value in uprobes, I just don't like it mucking about with the address space. Nor does it appear required. On Mon, Jan 18, 2010 at 2:09 PM, Avi Kivity a...@redhat.com wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. So how big chunks of the address space are we talking here for uprobes? As Srikar mentioned, the least we start with is 1 page. Though you can have as many probes as you want, there are certain optimizations we can do, depending on the most common usecases. For eg., if you'd consider the start of a routine to be the most commonly traced location, most routines in a binary would generally start with the same instruction (say push %ebp), and we can refcount a slot with that instruction to be used for all probes of the same instruction. Ananth
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Jim Keniston wrote: Not really. For #3 (boosting), you need to know everything for #2, plus be able to compute the length of each instruction -- which we can now do for x86. To emulate an instruction (#4), you need to replicate what it does, side-effects and all. The x86 instruction set seems to be adding new floating-point instructions all the time, and I bet even Masami doesn't know what they all do, but so far, they all seem to adhere to the instruction-length rules encoded in Masami's instruction decoder. Actually, current x86 decoder doesn't support FP(x87) instructions.(even it already supported AVX) But I think it's not so hard to add it. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/18/2010 05:43 PM, Ananth N Mavinakayanahalli wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. So how big chunks of the address space are we talking here for uprobes? As Srikar mentioned, the least we start with is 1 page. Though you can have as many probes as you want, there are certain optimizations we can do, depending on the most common usecases. For eg., if you'd consider the start of a routine to be the most commonly traced location, most routines in a binary would generally start with the same instruction (say push %ebp), and we can refcount a slot with that instruction to be used for all probes of the same instruction. But then you can't follow the instruction with a jump back to the code... -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, Jan 18, 2010 at 06:52:32PM +0200, Avi Kivity wrote: On 01/18/2010 05:43 PM, Ananth N Mavinakayanahalli wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. So how big chunks of the address space are we talking here for uprobes? As Srikar mentioned, the least we start with is 1 page. Though you can have as many probes as you want, there are certain optimizations we can do, depending on the most common usecases. For eg., if you'd consider the start of a routine to be the most commonly traced location, most routines in a binary would generally start with the same instruction (say push %ebp), and we can refcount a slot with that instruction to be used for all probes of the same instruction. But then you can't follow the instruction with a jump back to the code... Right. This will work only for the non boosted case where single-stepping is mandatory. I guess the tradeoff is vma space and speed. Ananth
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 10:58 -0500, Masami Hiramatsu wrote: Jim Keniston wrote: Not really. For #3 (boosting), you need to know everything for #2, plus be able to compute the length of each instruction -- which we can now do for x86. To emulate an instruction (#4), you need to replicate what it does, side-effects and all. The x86 instruction set seems to be adding new floating-point instructions all the time, and I bet even Masami doesn't know what they all do, but so far, they all seem to adhere to the instruction-length rules encoded in Masami's instruction decoder. Actually, current x86 decoder doesn't support FP(x87) instructions.(even it already supported AVX) But I think it's not so hard to add it. At one point I verified that it worked for all the x87 instructions in libm: https://www.redhat.com/archives/utrace-devel/2009-March/msg00031.html I'm pretty sure I tested mmx instructions as well. But I guess this was before you rearranged the opcode tables. Yeah, it wouldn't be hard to add back in, at least for purposes of computing instruction lengths. Jim
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:34 +0100, Mark Wielaard wrote: On Mon, 2010-01-18 at 14:53 +0200, Avi Kivity wrote: On 01/18/2010 02:51 PM, Pekka Enberg wrote: And how many probes do we expected to be live at the same time in real-world scenarios? I guess Avi's one million is more than enough? I don't think a user will ever come close to a million, but we can expect some inflation from inlined functions (I don't know if uprobes replicates such probes, but if it doesn't, it should). SystemTap by default places probes on all instances of an inlined function. It is still hard to get to a million probes though. $ stap -v -l 'process(/usr/bin/emacs).function(*)' [...] Pass 2: analyzed script: 4359 probe(s) You can try probing all statements (for every function, in every file, on every line of source code), but even that only adds up to ten thousands of probes: $ stap -v -l 'process(/usr/bin/emacs).statement(*...@*:*)' [...] Pass 2: analyzed script: 39603 probe(s) So a million is pretty far out, even if you add larger programs and all the shared libraries they are using. Thanks, Mark. One correction, below. As Srikar said the current allocation technique is the simplest you can do, one xol slot for each uprobe. But there are other techniques that you can use. Theoretically you only need a xol slot for each thread of a process that simultaneously hits a uprobe instance. That requires a bit more bookkeeping. The variant of uprobes that systemtap uses at the moment does that. Actually, it's per-probepoint, with a fixed number of slots. If the probepoint you just hit doesn't have a slot, and none are free, you steal a slot from another probepoint. Yeah, it's messy. We considered allocating slots per-thread, hoping to make it basically lockless, but that way there's more likely to be constant scribbling on the XOL area, as a thread with n slots cycles through n+m probepoints. And of course, it gets dicey as the process clones more threads. I guess the point is, there are a lot of ways to allocate slots, and we haven't found the perfect algorithm yet, even if you accept the existence of (and need for) the XOL area. Keep the ideas coming. But the locking in that case is pretty tricky, so it seemed easier to first get the code with the simplest xol allocation technique upstream. But if you do that than you can use a very small xol area to support millions of uprobes and only have to expand it when there are hundreds of threads in a process all hitting the probes simultaneously. Cheers, Mark Jim
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Jim Keniston wrote: On Mon, 2010-01-18 at 10:58 -0500, Masami Hiramatsu wrote: Jim Keniston wrote: Not really. For #3 (boosting), you need to know everything for #2, plus be able to compute the length of each instruction -- which we can now do for x86. To emulate an instruction (#4), you need to replicate what it does, side-effects and all. The x86 instruction set seems to be adding new floating-point instructions all the time, and I bet even Masami doesn't know what they all do, but so far, they all seem to adhere to the instruction-length rules encoded in Masami's instruction decoder. Actually, current x86 decoder doesn't support FP(x87) instructions.(even it already supported AVX) But I think it's not so hard to add it. At one point I verified that it worked for all the x87 instructions in libm: https://www.redhat.com/archives/utrace-devel/2009-March/msg00031.html I'm pretty sure I tested mmx instructions as well. But I guess this was before you rearranged the opcode tables. Yeah, it wouldn't be hard to add back in, at least for purposes of computing instruction lengths. objdump -d /lib/libm.so.6 | awk -f arch/x86/tools/distill.awk | ./test_get_len Succeed: decoded and checked 37198 instructions Hmm, yeah, that's already supported :-D. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/16/2010 02:58 AM, Jim Keniston wrote: I hear (er, read) you. Emulation may turn out to be the answer for some architectures. But here are some things to keep in mind about the various approaches: 1. Single-stepping inline is easiest: you need to know very little about the instruction set you're probing. But it's inadequate for multithreaded apps. 2. Single-stepping out of line solves the multithreading issue (as do #3 and #4), but requires more knowledge of the instruction set. (In particular, calls, jumps, and returns need special care; as do rip-relative instructions in x86_64.) I count 9 architectures that support kprobes. I think most of these do SSOL. 3. Boosted probes (where an appended jump instruction removes the need for the single-step trap on many instructions) require even more knowledge of the instruction set, and like SSOL, require XOL slots. Right now, as far as I know, x86 is the only architecture with boosted kprobes. 4. Emulation removes the need for the XOL area, but requires pretty much total knowledge of the instruction set. It's also a performance win for architectures that can't do #3. I see kvm implemented on 4 architectures (ia64, powerpc, s390, x86). Coincidentally, those are the architectures to which uprobes (old uprobes, with ubp and xol bundled in) has already been ported (though Intel hasn't been maintaining their ia64 port). So it sort of comes down to how objectionable the XOL vma (or page) really is. The kvm emulator emulates only a subset of the x86 instruction set (basically mmio instructions and commonly-used page-table manipulation instructions, as well as some privileged instructions). It would take a lot of work to expand it to be completely generic; and even then it will fail if userspace uses an instruction set extension the kernel is not aware of. To me, boosted probes with a fallback to single-stepping seems to be the better option by far. -- error compiling committee.c: too many arguments to function
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sun, 2010-01-17 at 16:56 +0200, Avi Kivity wrote: On 01/17/2010 04:52 PM, Peter Zijlstra wrote: Also, if its fixed size you're imposing artificial limits on the number of possible probes. Obviously we'll need a limit, a uprobe will also take kernel memory, we can't allow people to exhaust it. Only if its unprivilidged, kernel and root should be able to place as many probes until the machine keels over.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sun, 2010-01-17 at 16:59 +0200, Avi Kivity wrote: On 01/17/2010 04:52 PM, Peter Zijlstra wrote: On Sun, 2010-01-17 at 16:39 +0200, Avi Kivity wrote: On 01/15/2010 11:50 AM, Peter Zijlstra wrote: As previously stated, I think poking at a process's address space is an utter no-go. Why not reserve an address space range for this, somewhere near the top of memory? It doesn't have to be populated if it isn't used. Because I think poking at a process's address space like that is gross. Also, if its fixed size you're imposing artificial limits on the number of possible probes. btw, an alternative is to require the caller to provide the address space for this. If the caller is in another process, we need to allow it to play with the target's address space (i.e. mmap_process()). I don't think uprobes justifies this by itself, but mmap_process() can be very useful for sandboxing with seccomp. mmap_process() sounds utterly gross, one process playing with another process's address space.. yuck!
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/17/2010 05:03 PM, Peter Zijlstra wrote: btw, an alternative is to require the caller to provide the address space for this. If the caller is in another process, we need to allow it to play with the target's address space (i.e. mmap_process()). I don't think uprobes justifies this by itself, but mmap_process() can be very useful for sandboxing with seccomp. mmap_process() sounds utterly gross, one process playing with another process's address space.. yuck! This is debugging. We're playing with registers, we're playing with the cpu, we're playing with memory contents. Why not the address space as well? For seccomp, this really should be generalized. Run a system call on behalf of another process, but don't let that process do anything to affect it. I think Google is doing something clever with one thread in seccomp mode and another unconstrained, but that's very hacky - you have to stop the constrained thread so it can't interfere with the live one. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sat, 2010-01-16 at 18:48 -0500, Jim Keniston wrote: As you may have noted before, I think FP would be a special problem for your approach. I'm not sure how folks would react to the idea of executing FP instructions in kernel space. But emulating them is also tough. There's an IEEE FP emulation package somewhere in one of the Linux arch directories, but I'm not sure how precise it is, and dropping even 1 bit of precision is unacceptable for many applications, since such errors tend to grow in complex computations employing many FP instructions. Well, we have kernel space using FP/MMX/SSE like things, its not hard if you really need it, but in this case I think its easier than normal, because we'll just allow it to change the userspace state because that is exactly what we want it to do.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sat, 2010-01-16 at 19:12 -0500, Bryan Donlan wrote: On Fri, Jan 15, 2010 at 7:58 PM, Jim Keniston jkeni...@us.ibm.com wrote: 4. Emulation removes the need for the XOL area, but requires pretty much total knowledge of the instruction set. It's also a performance win for architectures that can't do #3. I see kvm implemented on 4 architectures (ia64, powerpc, s390, x86). Coincidentally, those are the architectures to which uprobes (old uprobes, with ubp and xol bundled in) has already been ported (though Intel hasn't been maintaining their ia64 port). So it sort of comes down to how objectionable the XOL vma (or page) really is. On x86 at least, wouldn't one option to be to run the instruction to be emulated in CPL ('ring') 2, from a XOL page above the user-kernel split, not accessible to userspace at CPL 3? Linux hasn't traditionally used anything other than CPL 0 and CPL 3 (plus CPL 1 on Xen), but it would seem to avoid many of the problems here - it's invisible to normal userspace code and so doesn't pollute userspace memory maps with kernel-private stuff, but since it's running at a higher CPL than the kernel, we can still protect kernel memory and protect against privileged instructions. Another option is to go play games with the RPL of the user data segments when we load them. But yeah, something like this seems to nicely deal with the protection issues.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sun, 2010-01-17 at 21:33 +0200, Avi Kivity wrote: On 01/17/2010 05:03 PM, Peter Zijlstra wrote: btw, an alternative is to require the caller to provide the address space for this. If the caller is in another process, we need to allow it to play with the target's address space (i.e. mmap_process()). I don't think uprobes justifies this by itself, but mmap_process() can be very useful for sandboxing with seccomp. mmap_process() sounds utterly gross, one process playing with another process's address space.. yuck! This is debugging. We're playing with registers, we're playing with the cpu, we're playing with memory contents. Why not the address space as well? Because you want thins go to be as transparent as possible in order to avoid heisenbugs. Sure we cannot avoid everything, but we should avoid everything we possibly can. Also, aside of the VDSO, we simply do not force map things into address spaces (and like said before, I think the VDSO stinks for doing that) and I think we don't want to create (more) precedents in this case.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, Jan 15, 2010 at 7:58 PM, Jim Keniston jkeni...@us.ibm.com wrote: 4. Emulation removes the need for the XOL area, but requires pretty much total knowledge of the instruction set. It's also a performance win for architectures that can't do #3. I see kvm implemented on 4 architectures (ia64, powerpc, s390, x86). Coincidentally, those are the architectures to which uprobes (old uprobes, with ubp and xol bundled in) has already been ported (though Intel hasn't been maintaining their ia64 port). So it sort of comes down to how objectionable the XOL vma (or page) really is. On x86 at least, wouldn't one option to be to run the instruction to be emulated in CPL ('ring') 2, from a XOL page above the user-kernel split, not accessible to userspace at CPL 3? Linux hasn't traditionally used anything other than CPL 0 and CPL 3 (plus CPL 1 on Xen), but it would seem to avoid many of the problems here - it's invisible to normal userspace code and so doesn't pollute userspace memory maps with kernel-private stuff, but since it's running at a higher CPL than the kernel, we can still protect kernel memory and protect against privileged instructions.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: discussed elsewhere. Thanks for the pointer...
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, Jan 15, 2010 at 10:03:48AM +0100, Peter Zijlstra wrote: On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: discussed elsewhere. Thanks for the pointer... :-) Peter, I think Jim was referring to http://sources.redhat.com/ml/systemtap/2007-q1/msg00571.html Ananth
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 15:08 +0530, Ananth N Mavinakayanahalli wrote: On Fri, Jan 15, 2010 at 10:03:48AM +0100, Peter Zijlstra wrote: On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: discussed elsewhere. Thanks for the pointer... :-) Peter, I think Jim was referring to http://sources.redhat.com/ml/systemtap/2007-q1/msg00571.html That's a 2007 email from some obscure list... that's hardly something that can be referenced to without link. As previously stated, I think poking at a process's address space is an utter no-go.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote: Ideas? emulate the one instruction?
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, Jan 15, 2010 at 11:13:32AM +0100, Peter Zijlstra wrote: On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote: Ideas? emulate the one instruction? In kernel? Generically? Don't think its that easy for userspace -- you have the full gamut of instructions to emulate (fp, vector, etc); further, the instruction could itself cause a page fault and the like.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 15:52 +0530, Ananth N Mavinakayanahalli wrote: On Fri, Jan 15, 2010 at 11:13:32AM +0100, Peter Zijlstra wrote: On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote: Ideas? emulate the one instruction? In kernel? Generically? Don't think its that easy for userspace -- you have the full gamut of instructions to emulate (fp, vector, etc); further, Can't you jit a piece of code that wraps the one instruction, save the full cpu state, set the userspace segments, have it load pt_regs (except for the IP) execute the one ins, save the results, restore the full state? Then replace pt_regs with the saved result and advance the stored IP by the length of that one instruction and return to userspace? All you need to take care of are the priv insns, but doesn't something like kvm already have code to deal with that? the instruction could itself cause a page fault and the like. Faults aren't a problem, we take faults from kernel space all the time.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 10:02 +0100, Peter Zijlstra wrote: On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: +Instruction copies to be single-stepped are stored in a per-process +single-step out of line (XOL) area, which is a little VM area +created by Uprobes in each probed process's address space. I think tinkering with the probed process's address space is a no-no. Have you ran this by the linux mm folks? Sort of. Back in 2007 (!), we were getting ready to post uprobes (which was then essentially uprobes+xol+upb) to LKML, pondering XOL alternatives and waiting for utrace to get pulled back into the -mm tree. (It turned out to be a long wait.) I emailed Andrew Morton, inquiring about the prospects for utrace and giving him a preview of utrace-based uprobes. He expressed openness to the idea of allocating a piece of the user address space for the XOL area, a la the vdso page. With advice and review from Dave Hansen, we implemented an XOL page, set up for every process (probed or not) along the same lines as the vdso page. About that time, Roland McGrath suggested using do_mmap_pgoff() to create a separate vma on demand. This was the seed of the current implementation. It had the advantages of being architecture-independent, affecting only probed processes, and allowing the allocation of more XOL slots. (Uprobes can make do with a fixed number of XOL slots -- allowing one probepoint to steal another's slot -- but it isn't pretty.) As I recall, Dave preferred the other idea (1 XOL page for every process, probed or not) -- mostly because he didn't like the idea of a new vma popping into existence when the process gets probed -- but was OK with us going ahead with Roland's idea. (I'm not a VM guy; pardon any imprecision in my language.) Jim I'd be inclined to NAK this straight out.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 13:07 -0800, Jim Keniston wrote: On Fri, 2010-01-15 at 10:02 +0100, Peter Zijlstra wrote: On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: +Instruction copies to be single-stepped are stored in a per-process +single-step out of line (XOL) area, which is a little VM area +created by Uprobes in each probed process's address space. I think tinkering with the probed process's address space is a no-no. Have you ran this by the linux mm folks? Sort of. Back in 2007 (!), we were getting ready to post uprobes (which was then essentially uprobes+xol+upb) to LKML, pondering XOL alternatives and waiting for utrace to get pulled back into the -mm tree. (It turned out to be a long wait.) I emailed Andrew Morton, inquiring about the prospects for utrace and giving him a preview of utrace-based uprobes. He expressed openness to the idea of allocating a piece of the user address space for the XOL area, a la the vdso page. With advice and review from Dave Hansen, we implemented an XOL page, set up for every process (probed or not) along the same lines as the vdso page. About that time, Roland McGrath suggested using do_mmap_pgoff() to create a separate vma on demand. This was the seed of the current implementation. It had the advantages of being architecture-independent, affecting only probed processes, and allowing the allocation of more XOL slots. (Uprobes can make do with a fixed number of XOL slots -- allowing one probepoint to steal another's slot -- but it isn't pretty.) As I recall, Dave preferred the other idea (1 XOL page for every process, probed or not) -- mostly because he didn't like the idea of a new vma popping into existence when the process gets probed -- but was OK with us going ahead with Roland's idea. Well, I think its all very gross, I would really like people to try and 'emulate' or plain execute those original instructions from kernel space. As to the privileged instructions, I think qemu/kvm like projects should have pretty much all of that covered. Nor do I think we need utrace at all to make user space probes useful. Even stronger, I think the focus on utrace made you get some fundamentals wrong. Its not mainly about task state, but like said, its about text mappings, which is something utrace knows nothing about. That is not to say you cannot build a useful interface from uprobes and utrace, but its not at all required or natural.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote: User Space Breakpoint Assistance Layer (UBP) User space breakpointing Infrastructure provides kernel subsystems with architecture independent interface to establish breakpoints in user applications. This patch provides core implementation of ubp and also wrappers for architecture dependent methods. So if this is the basic infrastructure to set userspace breakpoints, then why not call this uprobe? UBP currently supports both single stepping inline and execution out of line strategies. Two different probepoints in the same process can have two different strategies. maybe explain wth these are?
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Thu, 2010-01-14 at 12:08 +0100, Peter Zijlstra wrote: On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote: User Space Breakpoint Assistance Layer (UBP) User space breakpointing Infrastructure provides kernel subsystems with architecture independent interface to establish breakpoints in user applications. This patch provides core implementation of ubp and also wrappers for architecture dependent methods. So if this is the basic infrastructure to set userspace breakpoints, then why not call this uprobe? Ubp is for setting and removing breakpoints, and for supporting the two schemes (inline, out of line) for executing the probed instruction after you hit the breakpoint. Uprobes provides a higher-level API and deals with synchronization issues, process-vs-thread issues, execution of the client's (potentially buggy) probe handler, multiple probe clients, multiple probes at the same location, thread- and process-lifetime events, etc. UBP currently supports both single stepping inline and execution out of line strategies. Two different probepoints in the same process can have two different strategies. maybe explain wth these are? Here's a partial explanation from patch #6,section 1.1: +When a CPU hits the breakpoint instruction, a trap occurs, the CPU's +user-mode registers are saved, and a SIGTRAP signal is generated. +Uprobes intercepts the SIGTRAP and finds the associated uprobe. +It then executes the handler associated with the uprobe, passing the +handler the addresses of the uprobe struct and the saved registers. +... + +Next, Uprobes single-steps its copy of the probed instruction and +resumes execution of the probed process at the instruction following +the probepoint. (It would be simpler to single-step the actual +instruction in place, but then Uprobes would have to temporarily +remove the breakpoint instruction. This would create problems in a +multithreaded application. For example, it would open a time window +when another thread could sail right past the probepoint.) + +Instruction copies to be single-stepped are stored in a per-process +single-step out of line (XOL) area, which is a little VM area +created by Uprobes in each probed process's address space. This (single-stepping out of line = SSOL) is essentially what kprobes does on most architectures. XOL (execution out of line) is actually a broader category that could include other schemes, discussed elsewhere. Jim