Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]
On Wed, 2010-01-27 at 07:53 +0100, Peter Zijlstra wrote: On Fri, 2010-01-22 at 12:54 +0530, Ananth N Mavinakayanahalli wrote: On Fri, Jan 22, 2010 at 12:32:32PM +0530, Srikar Dronamraju wrote: Here is a summary of the Comments and actions that need to be taken for the current uprobes patchset. Please let me know if I missed or misunderstood any of your comments. 1. Uprobes depends on trap signal. Uprobes depends on trap signal rather than hooking to the global die notifier. It was suggested that we hook to the global die notifier. In the next version of patches, Uprobes will use the global die notifier and look at the per-task count of the probes in use to see if it has to be consumed. However this would reduce the ability of uprobe handlers to sleep. Since we are dealing with userspace, sleeping in handlers would have been a good feature. We are looking at ways to get around this limitation. We could set a TIF_ flag in the notifier to indicate a breakpoint hit and process it in task context before the task heads into userspace. OK, so we can go play stack games in the INT3 interrupt handler by moving to a non IST stack when it comes from userspace, or move kprobes over to INT1 or something. Right, it just got pointed out that INT1 doesn't have a single byte encoding, only INT0 and INT3 :/
Re: linux-next: add utrace tree
On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote: On Wed, 27 Jan 2010, Peter Zijlstra wrote: Right, so you're going to love uprobes, which does exactly that. The current proposal is overwriting the target instruction with an INT3 and injecting an extra vma into the target process's address space containing the original instruction(s) and possible jumps back to the old code stream. Just out of interest, how does it handle the threading issue? Last I saw, at least some CPU people were _very_ nervous about overwriting instructions if another CPU might be just about to execute them. Even the overwrite only the first byte with 'int3' made them go umm, I need to talk to some core CPU people to see if that's ok. They mumble about possible CPU errata, I$ coherency, instruction retry etc. I realize kprobes does this very thing, but kprobes is esoteric stuff and doesn't have much choice. In user space, you _could_ do the modification on a different physical page and then just switch the page table entry instead, and not get into the whole D$/I$ coherency thing at all. Right, so there's two aspects: 1) concurrency when inserting the probe 2) concurrency when hitting the probe 1) used to be dealt with by using utrace to stop all threads in the process and then writing the instruction. I suggested to CoW the page, modify the instruction, set the pagetable and flush tlbs at full speed -- the very thing you suggest here. 2) so traditionally (and the intel arch manual describes this) is to replace the instruction, single step it, and write the probe back. This is racy for multi-threading. The current uprobes stuff solves this by doing single-step-out-of-line (XOL). XOL injects a new vma into the target process and puts the old instruction there, then it single steps on the new location, leaving the original site with INT3. This doesn't work for things like RIP relative instructions, so uprobes considers them un-probable. Also, I myself really object to inserting a vma in a running process, its like a land-lord, sure he has the key but he won't come in an poke through your things. The alternative is to place the instruction in TLS or stack space, since each thread can only have a single trap at a time, you only need space for 1 instruction (plus a possible jump out to the original site). There is the 'problem' of marking the TLS/stack executable when being probed. Then there is the whole emulation angle, the uprobes people basically say its too much effort to write a x86 emulator.
Re: linux-next: add utrace tree
On Wed, 2010-01-27 at 11:55 +0100, Peter Zijlstra wrote: Right, so there's two aspects: 1) concurrency when inserting the probe 2) concurrency when hitting the probe 1) used to be dealt with by using utrace to stop all threads in the process and then writing the instruction. I suggested to CoW the page, modify the instruction, set the pagetable and flush tlbs at full speed -- the very thing you suggest here. Also, since executable maps are typically MAP_PRIVATE, you have to CoW anyway in order to modify it and I would exclude MAP_SHARED from being probable because then the modification could seep through into whatever was backing that thing.
Re: linux-next: add utrace tree
On Wed, 2010-01-27 at 16:35 +0530, Ananth N Mavinakayanahalli wrote: Probing RIP-relative instructions work just fine; there are fixups that take care of it. Ah my bad then, it was my understanding you simply bailed on those. Just for my information, how large are the replacement sequences?
Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]
On Fri, 2010-01-22 at 12:54 +0530, Ananth N Mavinakayanahalli wrote: On Fri, Jan 22, 2010 at 12:32:32PM +0530, Srikar Dronamraju wrote: Here is a summary of the Comments and actions that need to be taken for the current uprobes patchset. Please let me know if I missed or misunderstood any of your comments. 1. Uprobes depends on trap signal. Uprobes depends on trap signal rather than hooking to the global die notifier. It was suggested that we hook to the global die notifier. In the next version of patches, Uprobes will use the global die notifier and look at the per-task count of the probes in use to see if it has to be consumed. However this would reduce the ability of uprobe handlers to sleep. Since we are dealing with userspace, sleeping in handlers would have been a good feature. We are looking at ways to get around this limitation. We could set a TIF_ flag in the notifier to indicate a breakpoint hit and process it in task context before the task heads into userspace. Make that optional, not everybody might want that. Either provide a simple trampoline or use a flag to indicate the callback be called from process context on registration.
Re: linux-next: add utrace tree
On Fri, 2010-01-22 at 15:01 -0500, Frank Ch. Eigler wrote: So then there's uprobes, which is another potential utrace killer app That's bollocks, uprobes is an utter and total mis-match for utrace. Probing userspace is primarily about DSOs which is files and vma's, not tasks. You might maybe want a utrace interface to that, but that is largely non-interesting. IOW, we don't need utrace to make sensible use of uprobes. (And when I speak of uprobes I mean the thing formerly called UBP)
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Wed, 2010-01-20 at 11:43 +0200, Avi Kivity wrote: 1. Write a trace entry into shared memory, trap into the kernel on overflow. 2. Trap if a condition is satisfied (fast watchpoint implementation). So now you want to consume more of a process' address space to store trace data as well? Not to mention that that process could wreck the trace data rendering it utterly unreliable.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote: You've made it clear that you don't like it, but not why. The kernel already manages the user's address space (except for MAP_FIXED which is unreliable unless you've already reserved the address space). I don't see why adding a vma for debugging is so horrible. Well, the kernel only does what the user (and loader) tell it through mmap(). Other than that we never (except this VDSO thing) inject vmas, and I see no reason to start doing that now.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 13:01 +0200, Avi Kivity wrote: If we reserve some address space, you don't add any heisenbugs (at least, not any additional ones over emulation). Even if we don't, address space layout randomization means we're not keeping the address space layout constant between runs anyway. Well, it still limits the number of probes to the reserved area. If you want more you need to grow the area.. which then changes the state.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:01 +0200, Avi Kivity wrote: Maybe you place no value on uprobes. But people who debug userspace likely will see a reason. I do see value in uprobes, I just don't like it mucking about with the address space. Nor does it appear required.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:17 +0200, Avi Kivity wrote: On 01/18/2010 02:13 PM, Pekka Enberg wrote: So how big chunks of the address space are we talking here for uprobes? That's for the authors to answer, but at a guess, 32 bytes per probe (largest x86 instruction is 15 bytes), so 32 MB will give you a million probes. That's a piece of cake for x86-64, probably harder to justify for i386. Yeah, I'm aware of people turning off address space randomization to gain more virtual space on i386, I'm pretty sure those folks aren't going to be happy if we shrink it. Let alone them trying to probe their app.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-18 at 14:37 +0200, Avi Kivity wrote: On 01/18/2010 02:14 PM, Peter Zijlstra wrote: Well, the alternatives are very unappealing. Emulation and single-stepping are going to be very slow compared to a couple of jumps. With CPL2 or RPL on user segments the protection issue seems to be manageable for running the instructions from kernel space. CPL2 gives unrestricted access to the kernel address space; and RPL does not affect page level protection. Segment limits don't work on x86-64. But perhaps I missed something - these things are tricky. So setting RPL to 3 on the user segments allows access to kernel pages just fine? How useful.. :/ It should be possible to translate the instruction into an address space check, followed by the action, but that's still slower due to privilege level switches. Well, if you manage to do the address validation you don't need the priv level switch anymore, right? Are the ins encodings sane enough to recognize mem parameters without needing to know the actual ins? How about using a hw-breakpoint to close the gap for the inline single step? You could even re-insert the int3 lazily when you need the hw-breakpoint again. It would consume one hw-breakpoint register for each task/cpu that has probes though..
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sun, 2010-01-17 at 16:56 +0200, Avi Kivity wrote: On 01/17/2010 04:52 PM, Peter Zijlstra wrote: Also, if its fixed size you're imposing artificial limits on the number of possible probes. Obviously we'll need a limit, a uprobe will also take kernel memory, we can't allow people to exhaust it. Only if its unprivilidged, kernel and root should be able to place as many probes until the machine keels over.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sun, 2010-01-17 at 16:59 +0200, Avi Kivity wrote: On 01/17/2010 04:52 PM, Peter Zijlstra wrote: On Sun, 2010-01-17 at 16:39 +0200, Avi Kivity wrote: On 01/15/2010 11:50 AM, Peter Zijlstra wrote: As previously stated, I think poking at a process's address space is an utter no-go. Why not reserve an address space range for this, somewhere near the top of memory? It doesn't have to be populated if it isn't used. Because I think poking at a process's address space like that is gross. Also, if its fixed size you're imposing artificial limits on the number of possible probes. btw, an alternative is to require the caller to provide the address space for this. If the caller is in another process, we need to allow it to play with the target's address space (i.e. mmap_process()). I don't think uprobes justifies this by itself, but mmap_process() can be very useful for sandboxing with seccomp. mmap_process() sounds utterly gross, one process playing with another process's address space.. yuck!
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sat, 2010-01-16 at 18:48 -0500, Jim Keniston wrote: As you may have noted before, I think FP would be a special problem for your approach. I'm not sure how folks would react to the idea of executing FP instructions in kernel space. But emulating them is also tough. There's an IEEE FP emulation package somewhere in one of the Linux arch directories, but I'm not sure how precise it is, and dropping even 1 bit of precision is unacceptable for many applications, since such errors tend to grow in complex computations employing many FP instructions. Well, we have kernel space using FP/MMX/SSE like things, its not hard if you really need it, but in this case I think its easier than normal, because we'll just allow it to change the userspace state because that is exactly what we want it to do.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sat, 2010-01-16 at 19:12 -0500, Bryan Donlan wrote: On Fri, Jan 15, 2010 at 7:58 PM, Jim Keniston jkeni...@us.ibm.com wrote: 4. Emulation removes the need for the XOL area, but requires pretty much total knowledge of the instruction set. It's also a performance win for architectures that can't do #3. I see kvm implemented on 4 architectures (ia64, powerpc, s390, x86). Coincidentally, those are the architectures to which uprobes (old uprobes, with ubp and xol bundled in) has already been ported (though Intel hasn't been maintaining their ia64 port). So it sort of comes down to how objectionable the XOL vma (or page) really is. On x86 at least, wouldn't one option to be to run the instruction to be emulated in CPL ('ring') 2, from a XOL page above the user-kernel split, not accessible to userspace at CPL 3? Linux hasn't traditionally used anything other than CPL 0 and CPL 3 (plus CPL 1 on Xen), but it would seem to avoid many of the problems here - it's invisible to normal userspace code and so doesn't pollute userspace memory maps with kernel-private stuff, but since it's running at a higher CPL than the kernel, we can still protect kernel memory and protect against privileged instructions. Another option is to go play games with the RPL of the user data segments when we load them. But yeah, something like this seems to nicely deal with the protection issues.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Sun, 2010-01-17 at 21:33 +0200, Avi Kivity wrote: On 01/17/2010 05:03 PM, Peter Zijlstra wrote: btw, an alternative is to require the caller to provide the address space for this. If the caller is in another process, we need to allow it to play with the target's address space (i.e. mmap_process()). I don't think uprobes justifies this by itself, but mmap_process() can be very useful for sandboxing with seccomp. mmap_process() sounds utterly gross, one process playing with another process's address space.. yuck! This is debugging. We're playing with registers, we're playing with the cpu, we're playing with memory contents. Why not the address space as well? Because you want thins go to be as transparent as possible in order to avoid heisenbugs. Sure we cannot avoid everything, but we should avoid everything we possibly can. Also, aside of the VDSO, we simply do not force map things into address spaces (and like said before, I think the VDSO stinks for doing that) and I think we don't want to create (more) precedents in this case.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: discussed elsewhere. Thanks for the pointer...
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Thu, 2010-01-14 at 14:49 -0800, Jim Keniston wrote: On Thu, 2010-01-14 at 12:09 +0100, Peter Zijlstra wrote: On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote: Uprobes Infrastructure enables user to dynamically establish probepoints in user applications and collect information by executing a handler functions when the probepoints are hit. Please refer Documentation/uprobes.txt for more details. This patch provides the core implementation of uprobes. This patch builds on utrace infrastructure. You need to follow this up with the uprobes patch for your architecture. So all this is basically some glue between what you call ubp (the real userspace breakpoint stuff) and utrace? Or does it do more? My reply in http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/02483.html addresses this. Right, so all that need be done is add the multiple probe stuff to UBP and its a sane interface to use on its own, at which point I'd be inclined to call that uprobes (UBP really is an crap name). Then we can ditch the whole utrace muck as I see no reason to want to use that, whereas the ubp (given a sane name) looks interesting.
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Fri, 2010-01-15 at 04:26 -0500, Frank Ch. Eigler wrote: Peter Zijlstra pet...@infradead.org writes: [...] Right, so all that need be done is add the multiple probe stuff to UBP and its a sane interface to use on its own, at which point I'd be inclined to call that uprobes (UBP really is an crap name). At one point ubp+uprobes were one piece. They were separated on the suspicion that lkml would like them that way. Right, good thinking, that way we can use ubp without having to use utrace ;-) Then we can ditch the whole utrace muck as I see no reason to want to use that, whereas the ubp (given a sane name) looks interesting. Assuming you meant what you write, perhaps you misunderstand the layering relationship of these pieces. utrace underlies uprobes and other process manipulation functionality, present and future. Why, utrace doesn't at all look to bring a fundamental contribution to all that. If there's a proper kernel interface to install probes on userspace code (ubp seems to be mostly that) then I can use perf/ftrace to do the rest of the state management, no need to use utrace there. You can hardly force me to use utrace there, can you?
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 15:08 +0530, Ananth N Mavinakayanahalli wrote: On Fri, Jan 15, 2010 at 10:03:48AM +0100, Peter Zijlstra wrote: On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: discussed elsewhere. Thanks for the pointer... :-) Peter, I think Jim was referring to http://sources.redhat.com/ml/systemtap/2007-q1/msg00571.html That's a 2007 email from some obscure list... that's hardly something that can be referenced to without link. As previously stated, I think poking at a process's address space is an utter no-go.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote: Ideas? emulate the one instruction?
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Fri, 2010-01-15 at 15:56 +0530, Srikar Dronamraju wrote: Hi Peter, Or there could be two threads that could be racing to insert/delete a breakpoint. These synchronization issues are all handled by the Uprobes layer. Shouldn't be hard to put that in the ubp layer, right? Uprobes layer would need to be notified of process life-time events like fork/clone/exec/exit. No so much the process lifetimes as the vma life times are interesting, placing a hook in the vm code to track that isn't too hard, It also needs to know - when a breakpoint is hit - stop and resume a thread. A simple hook in the trap code is done quickly enough, and no reason to stop the thread, its not going anywhere when it traps.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 15:52 +0530, Ananth N Mavinakayanahalli wrote: On Fri, Jan 15, 2010 at 11:13:32AM +0100, Peter Zijlstra wrote: On Fri, 2010-01-15 at 15:40 +0530, Ananth N Mavinakayanahalli wrote: Ideas? emulate the one instruction? In kernel? Generically? Don't think its that easy for userspace -- you have the full gamut of instructions to emulate (fp, vector, etc); further, Can't you jit a piece of code that wraps the one instruction, save the full cpu state, set the userspace segments, have it load pt_regs (except for the IP) execute the one ins, save the results, restore the full state? Then replace pt_regs with the saved result and advance the stored IP by the length of that one instruction and return to userspace? All you need to take care of are the priv insns, but doesn't something like kvm already have code to deal with that? the instruction could itself cause a page fault and the like. Faults aren't a problem, we take faults from kernel space all the time.
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Fri, 2010-01-15 at 16:35 +0530, Maneesh Soni wrote: On Fri, Jan 15, 2010 at 11:33:27AM +0100, Peter Zijlstra wrote: On Fri, 2010-01-15 at 15:56 +0530, Srikar Dronamraju wrote: Hi Peter, Or there could be two threads that could be racing to insert/delete a breakpoint. These synchronization issues are all handled by the Uprobes layer. Shouldn't be hard to put that in the ubp layer, right? Uprobes layer would need to be notified of process life-time events like fork/clone/exec/exit. No so much the process lifetimes as the vma life times are interesting, placing a hook in the vm code to track that isn't too hard, I think similar hooks were given thumbs down in the previous incarnation of uprobes (which was implemented without utrace). http://lkml.indiana.edu/hypermail/linux/kernel/0603.2/1254.html I wasn't at all proposing to mess with a_ops, nor do you really need to, I was more thinking of adding a callback like perf_event_mmap() and a corresponding unmap(), that way you can track mapping life-times and add/remove probes accordingly. Adding the probe uses the fact that (most) executable mappings are MAP_PRIVATE and CoWs a private copy of the page with the modified ins, right? What does it do for MAP_SHARED|MAP_EXECUTABLE sections -- simply fail to add the probe?
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Fri, 2010-01-15 at 12:12 +0100, Peter Zijlstra wrote: On Fri, 2010-01-15 at 16:35 +0530, Maneesh Soni wrote: On Fri, Jan 15, 2010 at 11:33:27AM +0100, Peter Zijlstra wrote: On Fri, 2010-01-15 at 15:56 +0530, Srikar Dronamraju wrote: Hi Peter, Or there could be two threads that could be racing to insert/delete a breakpoint. These synchronization issues are all handled by the Uprobes layer. Shouldn't be hard to put that in the ubp layer, right? Uprobes layer would need to be notified of process life-time events like fork/clone/exec/exit. No so much the process lifetimes as the vma life times are interesting, placing a hook in the vm code to track that isn't too hard, I think similar hooks were given thumbs down in the previous incarnation of uprobes (which was implemented without utrace). http://lkml.indiana.edu/hypermail/linux/kernel/0603.2/1254.html I wasn't at all proposing to mess with a_ops, nor do you really need to, I was more thinking of adding a callback like perf_event_mmap() and a corresponding unmap(), that way you can track mapping life-times and add/remove probes accordingly. Adding the probe uses the fact that (most) executable mappings are MAP_PRIVATE and CoWs a private copy of the page with the modified ins, right? Does it clean up the CoW'ed page on removing the probe? Does that account for userspace having made other changes in between installing and removing the probe (for PROT_WRITE mappings obviously)?
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Fri, 2010-01-15 at 19:50 +0530, Srikar Dronamraju wrote: Srikar seemed to suggest it needed stop/resume. If process traps, We dont need to stop/resume other threads. uprobes needs threads to quiesce when inserting/deleting the breakpoint. Right, I don't think we need to at all. See the CoW thing from previous emails.
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Fri, 2010-01-15 at 09:22 -0500, Frank Ch. Eigler wrote: Hi - Well, I'm not in a position to argue line by line about the necessity or the cost of utrace low level guts, but this may represent the most practical engineering balance between functionality / modularity / undesirably intrusive modifications. How intrusive and non-modular is installing a DIE_INT3 notifier? I'm not sure about all the reasons pro/con, but it looks like installing such a systemwide hook would force every userspace breakpoint or kprobe event machine wide to pass through the hypothetical uprobes layer, whether or not applicable to a current task. Well, we'll have to pass through the global die notifier anyway, but a quick per task filter sounds like a good idea, we can do that by keeping a per-task count of the number of uprobes in use. Then the uprobe code can avoid the lookup if there are no task users and no global users. The advantage of this construct is that is easily allows for global users, whereas a utrace based one doesn't.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Fri, 2010-01-15 at 13:07 -0800, Jim Keniston wrote: On Fri, 2010-01-15 at 10:02 +0100, Peter Zijlstra wrote: On Thu, 2010-01-14 at 11:46 -0800, Jim Keniston wrote: +Instruction copies to be single-stepped are stored in a per-process +single-step out of line (XOL) area, which is a little VM area +created by Uprobes in each probed process's address space. I think tinkering with the probed process's address space is a no-no. Have you ran this by the linux mm folks? Sort of. Back in 2007 (!), we were getting ready to post uprobes (which was then essentially uprobes+xol+upb) to LKML, pondering XOL alternatives and waiting for utrace to get pulled back into the -mm tree. (It turned out to be a long wait.) I emailed Andrew Morton, inquiring about the prospects for utrace and giving him a preview of utrace-based uprobes. He expressed openness to the idea of allocating a piece of the user address space for the XOL area, a la the vdso page. With advice and review from Dave Hansen, we implemented an XOL page, set up for every process (probed or not) along the same lines as the vdso page. About that time, Roland McGrath suggested using do_mmap_pgoff() to create a separate vma on demand. This was the seed of the current implementation. It had the advantages of being architecture-independent, affecting only probed processes, and allowing the allocation of more XOL slots. (Uprobes can make do with a fixed number of XOL slots -- allowing one probepoint to steal another's slot -- but it isn't pretty.) As I recall, Dave preferred the other idea (1 XOL page for every process, probed or not) -- mostly because he didn't like the idea of a new vma popping into existence when the process gets probed -- but was OK with us going ahead with Roland's idea. Well, I think its all very gross, I would really like people to try and 'emulate' or plain execute those original instructions from kernel space. As to the privileged instructions, I think qemu/kvm like projects should have pretty much all of that covered. Nor do I think we need utrace at all to make user space probes useful. Even stronger, I think the focus on utrace made you get some fundamentals wrong. Its not mainly about task state, but like said, its about text mappings, which is something utrace knows nothing about. That is not to say you cannot build a useful interface from uprobes and utrace, but its not at all required or natural.
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote: User Space Breakpoint Assistance Layer (UBP) User space breakpointing Infrastructure provides kernel subsystems with architecture independent interface to establish breakpoints in user applications. This patch provides core implementation of ubp and also wrappers for architecture dependent methods. So if this is the basic infrastructure to set userspace breakpoints, then why not call this uprobe? UBP currently supports both single stepping inline and execution out of line strategies. Two different probepoints in the same process can have two different strategies. maybe explain wth these are?
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote: Uprobes Infrastructure enables user to dynamically establish probepoints in user applications and collect information by executing a handler functions when the probepoints are hit. Please refer Documentation/uprobes.txt for more details. This patch provides the core implementation of uprobes. This patch builds on utrace infrastructure. You need to follow this up with the uprobes patch for your architecture. So all this is basically some glue between what you call ubp (the real userspace breakpoint stuff) and utrace? Or does it do more?
Re: [RFC] [PATCH 4/7] Uprobes Implementation
On Tue, 2010-01-12 at 13:44 +0530, Ananth N Mavinakayanahalli wrote: Well, I wonder if perf can ride on utrace's callbacks for the hook_task() for the clone/fork cases? Well it could, but using all of utrace to simply get simple callbacks from copy_process() is just daft so we're not going to do that.
Re: [RFC] [PATCH 3/7] Execution out of line (XOL)
On Mon, 2010-01-11 at 17:55 +0530, Srikar Dronamraju wrote: Execution out of line (XOL) Slot allocation mechanism for Execution Out of Line strategy in User space breakpointing Inftrastructure. (XOL) This patch provides slot allocation mechanism for execution out of line strategy for use with user space breakpoint infrastructure. This patch requires utrace support in kernel. This patch provides five functions xol_get_insn_slot(), xol_free_insn_slot(), xol_put_area(), xol_get_area() and xol_validate_vaddr(). Current slot allocation mechanism: 1. Allocate one dedicated slot per user breakpoint. 2. If the allocated vma is completely used, expand current vma. 3. If we cant expand the vma, allocate a new vma. Say what? I see the text, but non of it makes any sense at all.
Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes
On Thu, 2010-01-14 at 12:23 +0100, Peter Zijlstra wrote: On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote: This patch implements ftrace plugin for uprobes. Right, like others have said, trace events is a much saner interface. So the easiest way I can see that working is to register uprobes against a file (not a pid). Just to clarify, this means you can do things like: p:uprobe_event dso:symbol[+offs] Irrespective of whether there are any current user of that file.
Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes
On Thu, 2010-01-14 at 12:35 +0100, Frederic Weisbecker wrote: On Thu, Jan 14, 2010 at 12:23:11PM +0100, Peter Zijlstra wrote: On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote: This patch implements ftrace plugin for uprobes. Right, like others have said, trace events is a much saner interface. So the easiest way I can see that working is to register uprobes against a file (not a pid). Then on creation it uses rmap to find all current maps of that file and install the probe if there is a consumer for that map. Then for each new mmap() of that file, we also need to check if there's a consumer ready and install the probe. That looks racy. Say you first create a probe on /bin/ls: perf probe p addr_in_ls /bin/ls then something else launches /bin/ls behind you, probe is set on it then you launch: perf record -e probe: /bin/ls Then it goes recording the previous instance. Uhm, why? Only the perf /bin/ls instance has a consumer and will thus have a probe installed. (Or if you want to use ftrace you need to always have all instances probed anyway)
Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes
On Thu, 2010-01-14 at 13:16 +0100, Mark Wielaard wrote: On Thu, 2010-01-14 at 12:29 +0100, Peter Zijlstra wrote: On Thu, 2010-01-14 at 12:23 +0100, Peter Zijlstra wrote: On Mon, 2010-01-11 at 17:56 +0530, Srikar Dronamraju wrote: This patch implements ftrace plugin for uprobes. Right, like others have said, trace events is a much saner interface. So the easiest way I can see that working is to register uprobes against a file (not a pid). Just to clarify, this means you can do things like: p:uprobe_event dso:symbol[+offs] Irrespective of whether there are any current user of that file. Yes, that is a good idea, you can then also refine that with a filter on a target pid. That is what systemtap also does, you define files (whether they are executables or shared libraries, etc) plus symbols/offsets/etc as targets and monitor when they get mapped in (either system wide, per executable or pid based). Well, the pid part is more the concern of the consumer of the trace-event. The event itself is task invariant and only cares about the particular code in question getting executed.
Re: [RFC,PATCH 14/14] utrace core
On Sun, 2009-12-13 at 16:25 -0800, Roland McGrath wrote: I'm sorry for the delay. I'm picking up with responding to the parts of your review that I did not include in my first reply. I appreciate very much the discussion you've had with Oleg about the issues that I did not address myself. I look forward to your replies to my comments in that first reply, which we have yet to see. Yeah, no worries, I'm not too quick these days myself.. Seems inconsistent on u32 vs enum utrace_resume_action. Why force enum utrace_resume_action into a u32? The first argument to almost all callbacks (all the ones made when the task is alive) is called action and it's consistent that its low bits contain an enum utrace_resume_action. The argument is u32 action in the report_signal and report_syscall_entry callbacks, where it combines an enum utrace_resume_action with an enum utrace_{signal,syscall}_action (respectively). It would indeed be more consistent to use u32 action throughout, but it seemed nicer not to have engine writers always writing utrace_resume_action(action) for all the cases where there are no other bits in there. C does implicit casts from enum to integer types, right? So always using u32 here would not impose any extra typing on the user, or am I missing something subtle here? The return value is a mirror of the u32 action argument. In many calls, it has only an enum utrace_resume_action in it. But in some it combines that with another enum utrace_*_action. There I went for consistency (and less typing) in the return type, since it doesn't make any difference to how you have to write the code in your callback function. The main reason I used u32 at all instead of unsigned int is just its shortness for less typing. I don't really mind changing these details to whatever people think is best. The other people writing code to use the utrace API may have more opinions than I do. I guess it could even be OK to make the return value and action argument always a plain enum utrace_resume_action, and use a second in/out enum utrace_{signal,syscall}_action * parameter in those particular calls. But that does put some more register pressure and loads/stores into this API. I don't mind the sharing of the argument, it just looked odd to have the u32/unsigned int/enum thing intermixed, since you care about typing length (as good a criteria as any) I'd just be lazy and make everything u32 ;-) Quite gross this.. can't we key off the tracehoook_report_clone_complete() and use a wakeup there? Yes, we intended to clean this up at some point later. But doing that is just a distraction and complication right now so we put it off. For this case, however, I suppose we could just punt for the initial version. This code exists to support the special semantics of calling utrace_attach_task() from inside the parent's report_clone() callback. That guarantees the parent that it wins any race with any third thread calling utrace_attach_task(). This guarantees it will be first attacher in the callback order, but the general subject of callback order is already something we think we will want to revisit in the future after we have more experience with lots of different engines trying to work together. It's most meaningful when using the UTRACE_ATTACH_EXCLUSIVE flag -- then you can use UTRACE_ATTACH_EXCLUSIVE|UTRACE_ATTACH_MATCH_OPS to synchronize with other threads trying to attach the same kind of engine to a task, and give special priority in that to the engine that attaches in report_clone() from tracing the parent. I broke the text so it reads easier for me, it might be me, it might not be proper English -- I'm not a native speaker -- but this is an example of what you asked for below. The thing is, your sentences are rather long, with lots of sub-parts and similar. I find I need a break after digesting a few such things. As to the content, can't you accomplish the same thing by processing such exclusive parent registration before exposing the child in the pid-hash? Right before cgroup_fork_callback() in kernel/fork.c:copy_process() seems like the ideal site. Really I came up with this special semantics originally just with ptrace in mind. ptrace is an engine that wants to exclude other tracer threads attaching asynchronously with the same kind of engine, and that wants to give special priority on a child to the parent's tracer (to implement PTRACE_O_TRACECLONE et al). However, Oleg's current ptrace code still relies on the old hard-coded locking kludges to exclude the separate attachers and to magically privilege the auto-attach from the parent's tracer. So we are not relying on this special semantics yet. We could just punt utrace_attach_delay() and remove the associated documentation about the special rights of report_clone() calling utrace_attach_task(). If we decide it helps clean things up later when we finish more
Re: [RFC,PATCH 14/14] utrace core
On Tue, 2009-12-08 at 16:04 +0100, Oleg Nesterov wrote: The problem is, this code was developed out-of-tree. That is why we would like to merge it asap, then do other changes which could be easily reviewed. Now, do you really mean we should throw out the working code, rewrite it avoiding these barriers, and resubmit? Sure, everything is possible. But this means another round of out-of-tree development with unclear results. Out-of-tree development is bad, it having taken lot of effort is no excuse for merging ugly. Now, I'm not against barriers at all, but code that is as barrier heavy as this, with such bad comments and no clear indication it was actually worth using so many barriers make me wonder. Barriers aren't free either, and having multiple such things in quick succession isn't nessecarily faster than a lock, but much less obvious.
Re: [RFC,PATCH 14/14] utrace core
On Tue, 2009-12-08 at 17:31 +0100, Oleg Nesterov wrote: If you take a task ref you can write the much saner: utrace_control() { ... spin_lock(utrace-lock); ... if (reset) utrace_reset(utrace); spin_unlock(utrace-lock); } No, get_task_struct() in utrace_reset() can't help, we should move it into utrace_control() then. And in this case it becomes even more subtle: it is needed because -utrace_flags may be cleared inside utrace_reset() and after that utrace_control()-spin_unlock() becomes unsafe. The task-utrace pointer is cleaned up on free_task()-tracehook_free_task()-utrace_free_task(), so by holding a ref on the task, we ensure -utrace stays around, and we can do spin_unlock(), right? Also. utrace_reset() drops utrace-lock to call put_detached_list() lockless. If we want to avoid the assymetric locking, every caller should pass struct list_head *detached to utrace_reset(), drop utrace-lock, and call put_detached_list(). All that seems to do is call -release() and kmem_cache_free()s the utrace_engine thing, why can't that be done with utrace-lock held? But yeah, passing that list along does seem like a better solution.
Re: [RFC,PATCH 14/14] utrace core
On Tue, 2009-12-01 at 23:08 +0100, Oleg Nesterov wrote: @@ -560,6 +625,20 @@ static inline void tracehook_report_deat int signal, void *death_cookie, int group_dead) { + /* + * This barrier ensures that our caller's setting of + * @task-exit_state precedes checking @task-utrace_flags here. + * If utrace_set_events() was just called to enable + * UTRACE_EVENT(DEATH), then we are obliged to call + * utrace_report_death() and not miss it. utrace_set_events() + * uses tasklist_lock to synchronize enabling the bit with the + * actual change to @task-exit_state, but we need this barrier + * to be sure we see a flags change made just before our caller + * took the tasklist_lock. + */ + smp_mb(); + if (task_utrace_flags(task) _UTRACE_DEATH_EVENTS) + utrace_report_death(task, death_cookie, group_dead, signal); } I don't think its allowed to pair a mb with a lock-barrier, since the lock barriers are semi-permeable. Could you clarify? According to memory-barriers.txt a mb can be paired with mb,wmb,rmb depending on the situation. For LOCK it states: (1) LOCK operation implication: Memory operations issued after the LOCK will be completed after the LOCK operation has completed. Memory operations issued before the LOCK may be completed after the LOCK operation has completed. Which is not something I can see making sense to pair with an mb. So either the comment is confusing and its again referring to an UNLOCK +LOCK pair, or there's something fishy @@ -589,10 +668,20 @@ static inline void set_notify_resume(str * asynchronously, this will be called again before we return to * user mode. * - * Called without locks. + * Called without locks. However, on some machines this may be + * called with interrupts disabled. */ static inline void tracehook_notify_resume(struct pt_regs *regs) { + struct task_struct *task = current; + /* + * This pairs with the barrier implicit in set_notify_resume(). + * It ensures that we read the nonzero utrace_flags set before + * set_notify_resume() was called by utrace setup. + */ + smp_rmb(); + if (task_utrace_flags(task)) + utrace_resume(task, regs); } Sending an IPI implies the mb? Yes, but this has nothing to do with IPI. The caller, do_notify_resume(), does: clear_thread_flag(TIF_NOTIFY_RESUME); tracehook_notify_resume: if (task_utrace_flags(task)) utrace_resume(); We should not read task_utrace_flags() before we clear TIF_NOTIFY_RESUME. Then this comment needs an update.. that wasn't at all clear to me. +static inline struct utrace *task_utrace_struct(struct task_struct *task) +{ + struct utrace *utrace; + + /* + * This barrier ensures that any prior load of task-utrace_flags + * is ordered before this load of task-utrace. We use those + * utrace_flags checks in the hot path to decide to call into + * the utrace code. The first attach installs task-utrace before + * setting task-utrace_flags nonzero, with a barrier between. + * See utrace_task_alloc(). + */ + smp_rmb(); + utrace = task-utrace; + + smp_read_barrier_depends(); /* See utrace_task_alloc(). */ + return utrace; +} I spot two barriers here, but only 1 over in utrace_task_alloc(), hmm? smp_read_barrier_depends() pairs with utrace_task_alloc()-wmb(). smp_rmb() is needed for another reason. Suppose the code does: if (task_utrace_flags() SOMETHING) do_something_with(task-utrace); if we race with utrace_attach_task(), we can see -utrace_flags != 0 but task-utrace == NULL without rmb(). Still not clear what we pair with.. There is no obvious barrier in utrace_attach_task() nor a comment referring to this site. +struct utrace_engine { +/* private: */ + struct kref kref; + void (*release)(void *); + struct list_head entry; + +/* public: */ + const struct utrace_engine_ops *ops; + void *data; + + unsigned long flags; +}; Sorry, the kernel is written in C, not C++. Hmm. I almost never read the comments, but these 2 look very clear to me ;) I see no point in adding comments like that at all. + * Most callbacks take an @action argument, giving the resume action + * chosen by other tracing engines. All callbacks take an @engine + * argument, and a @task argument, which is always equal to @current. Given that some functions have a lot of arguments (depleting regparam), isn't it more expensive to push current on the stack than it is to simply read it again? Yes, perhaps. Only -report_reap() really needs @task, it may be !current. +struct utrace_engine_ops { + u32 (*report_signal)(u32 action, + struct
Re: [RFC,PATCH 14/14] utrace core
On Tue, 2009-11-24 at 21:02 +0100, Oleg Nesterov wrote: + sect2 id=reaptitleFinal callbacks/title + para +The functionreport_reap/function callback is always the final event +in the life cycle of a traced thread. Tracing engines can use this as +the trigger to clean up their own data structures. The +functionreport_death/function callback is always the penultimate +event a tracing engine might see; it's seen unless the thread was +already in the midst of dying when the engine attached. Many tracing +engines will have no interest in when a parent reaps a dead process, +and nothing they want to do with a zombie thread once it dies; for +them, the functionreport_death/function callback is the natural +place to clean up data structures and detach. To facilitate writing +such engines robustly, given the asynchrony of +constantSIGKILL/constant, and without error-prone manual +implementation of synchronization schemes, the +applicationutrace/application infrastructure provides some special +guarantees about the functionreport_death/function and +functionreport_reap/function callbacks. It still takes some care +to be sure your tracing engine is robust to tear-down races, but these +rules make it reasonably straightforward and concise to handle a lot of +corner cases correctly. + /para + /sect2 This above text seems inconsistent. First you say report_reap() is the final callback and should be used for cleanup, then you say report_death() is the penultimate callback but might not always happen and people would normally use that as cleanup. If we cannot rely on report_death() then I would suggest to simply recommend report_reap() as cleanup callback and leave it at that. + para +There is nothing a kernel module can do to keep a structnamestruct +task_struct/structname alive outside of +functionrcu_read_lock/function. Sure there is, get_task_struct() comes to mind. When the task dies and is reaped +by its parent (or itself), that structure can be freed so that any +dangling pointers you have stored become invalid. +applicationutrace/application will not prevent this, but it can +help you detect it safely. By definition, a task that has been reaped +has had all its engines detached. All +applicationutrace/application calls can be safely called on a +detached engine if the caller holds a reference on that engine pointer, +even if the task pointer passed in the call is invalid. All calls +return constant-ESRCH/constant for a detached engine, which tells +you that the task pointer you passed could be invalid now. Since +functionutrace_control/function and +functionutrace_set_events/function do not block, you can call those +inside a functionrcu_read_lock/function section and be sure after +they don't return constant-ESRCH/constant that the task pointer is +still valid until functionrcu_read_unlock/function. The +infrastructure never holds task references of its own. And here you imply its existence. Though neither +functionrcu_read_lock/function nor any other lock is held while +making a callback, it's always guaranteed that the structnamestruct +task_struct/structname and the structnamestruct +utrace_engine/structname passed as arguments remain valid +until the callback function returns. + /para + + para +The common means for safely holding task pointers that is available to +kernel modules is to use structnamestruct pid/structname, which +permits functionput_pid/function from kernel modules. When using +that, the calls functionutrace_attach_pid/function, +functionutrace_control_pid/function, +functionutrace_set_events_pid/function, and +functionutrace_barrier_pid/function are available. + /para The above seems weird to me at best... why hold a pid around when you can keep a ref on the task struct? A pid might get reused leaving you with a completely different task than you thought you had. + /sect2 + + sect2 id=reap-after-death +title + Serialization of constantDEATH/constant and constantREAP/constant +/title +para + The second guarantee is the serialization of + constantDEATH/constant and constantREAP/constant event + callbacks for a given thread. The actual reaping by the parent + (functionrelease_task/function call) can occur simultaneously + while the thread is still doing the final steps of dying, including + the functionreport_death/function callback. If a tracing engine + has requested both constantDEATH/constant and + constantREAP/constant event reports, it's guaranteed that the + functionreport_reap/function callback will not be made until
Re: [RFC] [PATCH] In-kernel gdbstub based on utrace Infrastructure.
On Mon, 2009-11-30 at 17:33 +0530, Srikar Dronamraju wrote: This patch implements an in-kernel gdb stub. It provides an interface between gdb and Linux Kernel by implementing the remote serial protocol. This gdbstub uses utrace infrastructure. This patch provides register set access, signal mapping, process event handling, input/output operations. /proc/pid/gdb was chosen as file for gdb to interact with the process through remote serial protocol. Hence users would have to use target remote /proc/pid/gdb command on gdb prompt to start using this infrastructure. For Breakpointing support, gdbstub needs User space breakpointing layer and uprobes layer which will be posted later. How does this compare to kgdb and related efforts?
Re: [RFC] [PATCH] In-kernel gdbstub based on utrace Infrastructure.
On Mon, 2009-11-30 at 18:02 +0530, Srikar Dronamraju wrote: * Peter Zijlstra pet...@infradead.org [2009-11-30 13:09:12]: On Mon, 2009-11-30 at 17:33 +0530, Srikar Dronamraju wrote: This patch implements an in-kernel gdb stub. It provides an interface between gdb and Linux Kernel by implementing the remote serial protocol. This gdbstub uses utrace infrastructure. This patch provides register set access, signal mapping, process event handling, input/output operations. /proc/pid/gdb was chosen as file for gdb to interact with the process through remote serial protocol. Hence users would have to use target remote /proc/pid/gdb command on gdb prompt to start using this infrastructure. For Breakpointing support, gdbstub needs User space breakpointing layer and uprobes layer which will be posted later. How does this compare to kgdb and related efforts? This is a In-kernel gdbstub to debug user space programs. This stub doesnt help in debugging kernel. Hence I am not sure how to compare kgdb gdbstub with this gdbstub. Can you please provide more pointers on what you were referring to? Well, not even that much was clear from your changelog, so I wasn't really sure wtf I was looking at. All it says was an in-kernel gdb stub, what other than to debug the kernel would you need in-kernel stubs for? So now my question is, what do you need an in-kernel stub to debug userspace for? In general, tell me about this patch thing, what does it do, why does it do it, and how does it improve on the current situation. Your changelog doesn't address any of those things, so wth are we supposed to think?
Re: [RFC] [PATCH] In-kernel gdbstub based on utrace Infrastructure.
On Mon, 2009-11-30 at 18:49 +0530, Srikar Dronamraju wrote: This is suppose to be one of the interfaces to use utrace, i.e Allow gdb to use utrace features without having to change gdb itself. Though there are not enough features in this patch, intentions include support multi-threaded debugging, concurrent debugger support for same process, syscall tracing. For Breakpoint support(not yet submitted to LKML), it would use execution out of line rather than the conventional inline-single stepping. I guess Christoph, Roland and Frank would be able to explain in a better fashion the rational and advantages of this stub over convential gdb. Hmm,. wouldn't it make much more sense to extend the current kgdb stub to include userspace debugging, providing an all-in-one solution? I think it would be much more powerful to be able to observe the full software stack and not be limited by this user-kernel barrier. (Provided the user has sufficient privileges of course).
Re: [RFC,PATCH 0/14] utrace/ptrace
On Thu, 2009-11-26 at 12:37 +0530, Srikar Dronamraju wrote: Hi Christoph, The other thing is that this patchset really doesn't quite justify utrace. It's growing a lot more code without actually growing any useful functionality. What about all those other utrace killer features that have been promised for a long time? We are working on in-kernel gdbstub which was one of the features that you had asked for. gdbstub does pass unit tests; but we are looking at some way to hack the GDB testsuite to run its regression tests. Once we are able to run the GDB testsuite and utrace is part of some upstream tree, we plan to post these patches to LKML for comments. gdbstub uses utrace and uprobes underneath. Uprobes was rewritten to remove issues that LKML developers had opposed. Uprobes also has its own ftrace plugin to use uprobes. Currently in-kernel gdbstub is hosted by Frank Ch. Eigler over here: git://web.elastic.org/~fche/utrace-ext.git branch name utrace-gdbstub-uprobes If its anywhere near functioning it would have made sense to send it out as an RFC patch-set right along with the utrace one.