Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]
On Wed, 2010-01-27 at 07:53 +0100, Peter Zijlstra wrote: On Fri, 2010-01-22 at 12:54 +0530, Ananth N Mavinakayanahalli wrote: On Fri, Jan 22, 2010 at 12:32:32PM +0530, Srikar Dronamraju wrote: Here is a summary of the Comments and actions that need to be taken for the current uprobes patchset. Please let me know if I missed or misunderstood any of your comments. 1. Uprobes depends on trap signal. Uprobes depends on trap signal rather than hooking to the global die notifier. It was suggested that we hook to the global die notifier. In the next version of patches, Uprobes will use the global die notifier and look at the per-task count of the probes in use to see if it has to be consumed. However this would reduce the ability of uprobe handlers to sleep. Since we are dealing with userspace, sleeping in handlers would have been a good feature. We are looking at ways to get around this limitation. We could set a TIF_ flag in the notifier to indicate a breakpoint hit and process it in task context before the task heads into userspace. OK, so we can go play stack games in the INT3 interrupt handler by moving to a non IST stack when it comes from userspace, or move kprobes over to INT1 or something. Right, it just got pointed out that INT1 doesn't have a single byte encoding, only INT0 and INT3 :/
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
On 01/27/2010 10:24 AM, Ingo Molnar wrote: Not to mention that that process could wreck the trace data rendering it utterly unreliable. It could, but it also might not. Are we going to deny high performance tracing to users just because it doesn't work in all cases? Tracing and monitoring is foremost about being able to trust the instrument, then about performance and usability. That's one of the big things about ftrace and perf. By proposing 'user space tracing' you are missing two big aspects: - That self-contained, kernel-driven tracing can be replicated in user-space. It cannot. Sharing and global state is much harder to maintain reliably, but the bigger problem is that user-space can stomp on its own tracing state and can make it unreliable. Tracing is often used to figure out bugs, and tracers will be trusted less if they can stomp on themselves. - That somehow it's much faster and that this edge matters. It isnt and it doesnt matter. The few places that need very very fast tracing wont use any of these facilities - it will use something specialized. So you are creating a solution for special cases that dont need it, and you are also ignoring prime qualities of a good tracing framework. I see it exactly the opposite. Only a very small minority of cases will have such severe memory corruption that tracing will fall apart because of random writes to memory; especially on 64-bit where the address space is sparse. On the other hand, knowing that the cost is a few dozen cycles rather than a thousand or so means that you can trace production servers running full loads without worrying about whether tracing will affect whatever it is you're trying to observe. I'm not against slow reliable tracing, but we shouldn't ignore the need for speed. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Re: linux-next: add utrace tree
* Peter Zijlstra pet...@infradead.org wrote: On Tue, 2010-01-26 at 15:37 -0800, Linus Torvalds wrote: On Tue, 26 Jan 2010, Tom Tromey wrote: In non-stop mode (where you can stop one thread but leave the others running), gdb wants to have the breakpoints always inserted. So, something must emulate the displaced instruction. I'm almost totally uninterested in breakpoints that actually re-write instructions. It's impossible to do that efficiently and well, especially in threaded environments. So if you do instruction rewriting, I can only say that's your problem. Right, so you're going to love uprobes, which does exactly that. The current proposal is overwriting the target instruction with an INT3 and injecting an extra vma into the target process's address space containing the original instruction(s) and possible jumps back to the old code stream. I'm all in favor of not doing that extra vma and instead use stack or TLS space, but then people complain about having to make that executable (which is something I don't really mind, x86 had executable everything for very long, and also, its only so when debugging the thing anyway). I think the best solution for user probes (by far) is to use a simplified in-kernel instruction emulator for the few common probes instruction. (Kprobes already partially decodes x86 instructions to make it safe to apply accelerated probes and there's other decoding logic in the kernel too.) The design and practical advantages are numerous: - People want to probe their function prologues most of the time ... a single INT3 there will in most cases just hit the initial stack allocation and that's it. We could get quite good coverage (and very fast emulation) for the common case in not too much code - and much of that code we already have available. No re-trapping, no extra instruction patching and complex maintenance of trampolines. - It's as transparent as it gets - no user-space trampoline or other visible state that modifies behavior or can be stomped upon by user-space bugs. - Lightweight and simple probe insertion: no weird setup sequence needing the stopping of all tasks to install the trampoline. We just add the INT3 and off you go. - Emulation is evidently thread-safe, SMP-safe, etc. as it only acts on task local state. - The points we can probe are never truly limited as it's all freely upscalable: if you cannot probe an instruction you want to probe today, extend the emulator. Deny the rest. _All_ versions of uprobes code i've seen so far already restricts the probe-compatible instruction set: RIP-relative instructions are excluded on 64-bit for example. - Emulation has the _least_ semantical side effects as we really execute 'that' instruction - not some other instruction put elsewhere into a special vma or into the process/thread stack, or some special in-kernel trampoline, etc. - Emulation can be very fast for the common case as well. Nobody will probe weird, complex instructions. They will use 'perf probe' to insert probes into their functions 90% of the time ... - FPU and complex ops and pagefault emulation is not really what i'd expect to be necessary for simple probing - but it _can_ be added by people who care about it, if they so wish. Such a scheme would be _far_ more preferable form a maintenance POV as well, as the initial code will be small, and we can extend it gradually. All the other proposals are complex 'all or nothing' schemes with no flexibility for complexity at all. Thanks, Ingo
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
* Avi Kivity a...@redhat.com wrote: On 01/27/2010 10:24 AM, Ingo Molnar wrote: Not to mention that that process could wreck the trace data rendering it utterly unreliable. It could, but it also might not. Are we going to deny high performance tracing to users just because it doesn't work in all cases? Tracing and monitoring is foremost about being able to trust the instrument, then about performance and usability. That's one of the big things about ftrace and perf. By proposing 'user space tracing' you are missing two big aspects: - That self-contained, kernel-driven tracing can be replicated in user-space. It cannot. Sharing and global state is much harder to maintain reliably, but the bigger problem is that user-space can stomp on its own tracing state and can make it unreliable. Tracing is often used to figure out bugs, and tracers will be trusted less if they can stomp on themselves. - That somehow it's much faster and that this edge matters. It isnt and it doesnt matter. The few places that need very very fast tracing wont use any of these facilities - it will use something specialized. So you are creating a solution for special cases that dont need it, and you are also ignoring prime qualities of a good tracing framework. I see it exactly the opposite. Only a very small minority of cases will have such severe memory corruption that tracing will fall apart because of random writes to memory; especially on 64-bit where the address space is sparse. On the other hand, knowing that the cost is a few dozen cycles rather than a thousand or so means that you can trace production servers running full loads without worrying about whether tracing will affect whatever it is you're trying to observe. I'm not against slow reliable tracing, but we shouldn't ignore the need for speed. I havent seen a conscise summary of your points in this thread, so let me summarize it as i've understood them (hopefully not putting words into your mouth): AFAICS you are arguing for some crazy fragile architecture-specific solution that traps INT3 into ring3 just to shave off a few cycles, and then use user-space state to trace into. If so then you ignore the obvious solution to _that_ problem: dont use INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks. It's _MUCH_ faster than _any_ breakpoint based solution - literally just the cost of a function call (or not even that - i've written very fast inlined tracers - they do rock when it comes to performance). Problem solved and none of the INT3 details matters at all. INT3 only matters to _transparent_ probing, and for that, the cost of INT3 is almost _by definition_ less important than the fact that we can do transparent tracing. If performance were the overriding issue they'd use dedicated callbacks - and the INT3 technique wouldnt matter at all. ( Also, just like we were able to extend the kprobes code with more and more optimizations, the same can be done with any user-space probing as well, to make it faster. But at the core of it has to be a sane design that is transparent and controlled by the kernel, so that it has the option to apply more and more otimizations - yours isnt such and its limitations are designed-in. Which is neither smart nor useful. ) Ingo
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
* Avi Kivity a...@redhat.com wrote: If so then you ignore the obvious solution to _that_ problem: dont use INT3 at all, but rebuild (or re-JIT) your program with explicit callbacks. It's _MUCH_ faster than _any_ breakpoint based solution - literally just the cost of a function call (or not even that - i've written very fast inlined tracers - they do rock when it comes to performance). Problem solved and none of the INT3 details matters at all. However did I not think of that? Yes, and let's rip off kprobes tracing from the kernel, we can always rebuild it. Well, I'm observing an issue in a production system now. I may not want to take it down, or if I take it down I may not be able to observe it again as the problem takes a couple of days to show up, or I may not have the full source, or it takes 10 minutes to build and so an iterative edit/build/run cycle can stretch for hours. You have somewhat misconstrued my argument. What i said above is that _if_ you need extreme levels of performance you always have the option to go even faster via specialized tracing solutions. I did not promote it as a replacement solution. Specialization obviously brings in a new set of problems: infexibility and non-transparency, an example of what you gave above. Your proposed solution brings in precisely such kinds of issues, on a different level, just to improve performance at the cost of transparency and at the cost of features and robustness. It's btw rather ironic as your arguments are somewhat similar to the Xen vs. KVM argument just turned around: KVM started out slower by relying on hardware implementation for virtualization while Xen relied on a clever but limiting hack. With each CPU generation the hardware got faster, while the various design limitations of Xen are hurting it and KVM is winning that race. A (partially) similar situation exists here: INT3 into ring 0 and handling it there in a protected environment might be more expensive, but _if_ it matters to performance it sure could be made faster in hardware (and in fact it will become faster with every new generation of hardware). Both Peter and me are telling you that we are considering your solution too specialized, at the cost of flexibility, features and robustness. Thanks, Ingo
Re: linux-next: add utrace tree
On Wed, 27 Jan 2010, Peter Zijlstra wrote: Right, so you're going to love uprobes, which does exactly that. The current proposal is overwriting the target instruction with an INT3 and injecting an extra vma into the target process's address space containing the original instruction(s) and possible jumps back to the old code stream. Just out of interest, how does it handle the threading issue? Last I saw, at least some CPU people were _very_ nervous about overwriting instructions if another CPU might be just about to execute them. Even the overwrite only the first byte with 'int3' made them go umm, I need to talk to some core CPU people to see if that's ok. They mumble about possible CPU errata, I$ coherency, instruction retry etc. I realize kprobes does this very thing, but kprobes is esoteric stuff and doesn't have much choice. In user space, you _could_ do the modification on a different physical page and then just switch the page table entry instead, and not get into the whole D$/I$ coherency thing at all. Linus
Re: linux-next: add utrace tree
On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote: On Wed, 27 Jan 2010, Peter Zijlstra wrote: Right, so you're going to love uprobes, which does exactly that. The current proposal is overwriting the target instruction with an INT3 and injecting an extra vma into the target process's address space containing the original instruction(s) and possible jumps back to the old code stream. Just out of interest, how does it handle the threading issue? Last I saw, at least some CPU people were _very_ nervous about overwriting instructions if another CPU might be just about to execute them. Even the overwrite only the first byte with 'int3' made them go umm, I need to talk to some core CPU people to see if that's ok. They mumble about possible CPU errata, I$ coherency, instruction retry etc. I realize kprobes does this very thing, but kprobes is esoteric stuff and doesn't have much choice. In user space, you _could_ do the modification on a different physical page and then just switch the page table entry instead, and not get into the whole D$/I$ coherency thing at all. Right, so there's two aspects: 1) concurrency when inserting the probe 2) concurrency when hitting the probe 1) used to be dealt with by using utrace to stop all threads in the process and then writing the instruction. I suggested to CoW the page, modify the instruction, set the pagetable and flush tlbs at full speed -- the very thing you suggest here. 2) so traditionally (and the intel arch manual describes this) is to replace the instruction, single step it, and write the probe back. This is racy for multi-threading. The current uprobes stuff solves this by doing single-step-out-of-line (XOL). XOL injects a new vma into the target process and puts the old instruction there, then it single steps on the new location, leaving the original site with INT3. This doesn't work for things like RIP relative instructions, so uprobes considers them un-probable. Also, I myself really object to inserting a vma in a running process, its like a land-lord, sure he has the key but he won't come in an poke through your things. The alternative is to place the instruction in TLS or stack space, since each thread can only have a single trap at a time, you only need space for 1 instruction (plus a possible jump out to the original site). There is the 'problem' of marking the TLS/stack executable when being probed. Then there is the whole emulation angle, the uprobes people basically say its too much effort to write a x86 emulator.
Re: linux-next: add utrace tree
On Wed, 2010-01-27 at 11:55 +0100, Peter Zijlstra wrote: Right, so there's two aspects: 1) concurrency when inserting the probe 2) concurrency when hitting the probe 1) used to be dealt with by using utrace to stop all threads in the process and then writing the instruction. I suggested to CoW the page, modify the instruction, set the pagetable and flush tlbs at full speed -- the very thing you suggest here. Also, since executable maps are typically MAP_PRIVATE, you have to CoW anyway in order to modify it and I would exclude MAP_SHARED from being probable because then the modification could seep through into whatever was backing that thing.
Re: linux-next: add utrace tree
On Wed, Jan 27, 2010 at 11:55:16AM +0100, Peter Zijlstra wrote: On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote: On Wed, 27 Jan 2010, Peter Zijlstra wrote: Right, so you're going to love uprobes, which does exactly that. The current proposal is overwriting the target instruction with an INT3 and injecting an extra vma into the target process's address space containing the original instruction(s) and possible jumps back to the old code stream. Just out of interest, how does it handle the threading issue? Last I saw, at least some CPU people were _very_ nervous about overwriting instructions if another CPU might be just about to execute them. Even the overwrite only the first byte with 'int3' made them go umm, I need to talk to some core CPU people to see if that's ok. They mumble about possible CPU errata, I$ coherency, instruction retry etc. I realize kprobes does this very thing, but kprobes is esoteric stuff and doesn't have much choice. In user space, you _could_ do the modification on a different physical page and then just switch the page table entry instead, and not get into the whole D$/I$ coherency thing at all. Right, so there's two aspects: 1) concurrency when inserting the probe 2) concurrency when hitting the probe 1) used to be dealt with by using utrace to stop all threads in the process and then writing the instruction. I suggested to CoW the page, modify the instruction, set the pagetable and flush tlbs at full speed -- the very thing you suggest here. 2) so traditionally (and the intel arch manual describes this) is to replace the instruction, single step it, and write the probe back. This is racy for multi-threading. The current uprobes stuff solves this by doing single-step-out-of-line (XOL). XOL injects a new vma into the target process and puts the old instruction there, then it single steps on the new location, leaving the original site with INT3. This doesn't work for things like RIP relative instructions, so uprobes considers them un-probable. Probing RIP-relative instructions work just fine; there are fixups that take care of it. Also, I myself really object to inserting a vma in a running process, its like a land-lord, sure he has the key but he won't come in an poke through your things. The alternative is to place the instruction in TLS or stack space, since each thread can only have a single trap at a time, you only need space for 1 instruction (plus a possible jump out to the original site). There is the 'problem' of marking the TLS/stack executable when being probed. Then there is the whole emulation angle, the uprobes people basically say its too much effort to write a x86 emulator. We don't need to write one. I don't know how easy it is to make the kvm emulator less kvm-centric (vcpus, kvm_context, etc). Avi? Ananth
Re: linux-next: add utrace tree
On Wed, 27 Jan 2010, Peter Zijlstra wrote: Right, so there's two aspects: 1) concurrency when inserting the probe That's the one I worried about. Stopping all threads will fix it, obviously at a disastrous performance cost, but what do I care? As noted, there are ways to do it safely with TLB switching, so it's fixable. 2) concurrency when hitting the probe Yeah, I didn't worry about this part, since the only solution is the out-of-line one, and I don't much care how the memory gets allocated for it. Inserting a whole new vma seems pretty drastic, but compared to stopping all threads, it's a small thing. Linus
Re: linux-next: add utrace tree
On Wed, 2010-01-27 at 16:35 +0530, Ananth N Mavinakayanahalli wrote: Probing RIP-relative instructions work just fine; there are fixups that take care of it. Ah my bad then, it was my understanding you simply bailed on those. Just for my information, how large are the replacement sequences?
Re: linux-next: add utrace tree
On Wed, Jan 27, 2010 at 12:08:31PM +0100, Peter Zijlstra wrote: On Wed, 2010-01-27 at 16:35 +0530, Ananth N Mavinakayanahalli wrote: Probing RIP-relative instructions work just fine; there are fixups that take care of it. Ah my bad then, it was my understanding you simply bailed on those. Just for my information, how large are the replacement sequences? The RIP relative instruction is transformed into indirect addressing mode using a scratch register. For details http://marc.info/?l=linux-kernelm=126401936114639w=2. Ananth
Re: linux-next: add utrace tree
[ Added Arjan ] On Wed, 2010-01-27 at 02:43 -0800, Linus Torvalds wrote: On Wed, 27 Jan 2010, Peter Zijlstra wrote: Right, so you're going to love uprobes, which does exactly that. The current proposal is overwriting the target instruction with an INT3 and injecting an extra vma into the target process's address space containing the original instruction(s) and possible jumps back to the old code stream. Just out of interest, how does it handle the threading issue? Last I saw, at least some CPU people were _very_ nervous about overwriting instructions if another CPU might be just about to execute them. I think the issue was that ring 0 was never meant to do that, where as, ring 3 does it all the time. Doesn't the dynamic library modify its text? -- Steve Even the overwrite only the first byte with 'int3' made them go umm, I need to talk to some core CPU people to see if that's ok. They mumble about possible CPU errata, I$ coherency, instruction retry etc. I realize kprobes does this very thing, but kprobes is esoteric stuff and doesn't have much choice. In user space, you _could_ do the modification on a different physical page and then just switch the page table entry instead, and not get into the whole D$/I$ coherency thing at all. Linus
Re: linux-next: add utrace tree
On 01/27/2010 02:43 AM, Linus Torvalds wrote: On Wed, 27 Jan 2010, Peter Zijlstra wrote: Right, so you're going to love uprobes, which does exactly that. The current proposal is overwriting the target instruction with an INT3 and injecting an extra vma into the target process's address space containing the original instruction(s) and possible jumps back to the old code stream. Just out of interest, how does it handle the threading issue? Last I saw, at least some CPU people were _very_ nervous about overwriting instructions if another CPU might be just about to execute them. Even the overwrite only the first byte with 'int3' made them go umm, I need to talk to some core CPU people to see if that's ok. They mumble about possible CPU errata, I$ coherency, instruction retry etc. We actually went through a review of that here at Intel. We do not yet have an *official* answer (in order for us to have that we have to have it approved by the architecture committee and published in the SDM), but to the best of our current knowledge (and I'm allowed to say this) the int3 method followed by global IPIs should be safe for modifying *one (atomic) instruction*. This is a specific case of a more general rule, but I don't want to disclose the whole rule until it has been officially approved. I realize kprobes does this very thing, but kprobes is esoteric stuff and doesn't have much choice. In user space, you _could_ do the modification on a different physical page and then just switch the page table entry instead, and not get into the whole D$/I$ coherency thing at all. On the more general rule of interpretation: I'm really concerned about having a bunch of partially-capable x86 interpreters all over the kernel. x86 is *hard* to emulate, and it will only get harder as the architecture evolves. -hpa
Re: linux-next: add utrace tree
On Wed, 2010-01-27 at 09:54 +0100, Ingo Molnar wrote: ... I think the best solution for user probes (by far) is to use a simplified in-kernel instruction emulator for the few common probes instruction. (Kprobes already partially decodes x86 instructions to make it safe to apply accelerated probes and there's other decoding logic in the kernel too.) The design and practical advantages are numerous: - People want to probe their function prologues most of the time ... a single INT3 there will in most cases just hit the initial stack allocation and that's it. Yes, emulating push %ebp would buy us a lot of coverage for a lot of apps on x86 (but see below**). Even there, though, we'd have to address the page fault we'd occasionally get when extending the stack vma. We could get quite good coverage (and very fast emulation) for the common case in not too much code - and much of that code we already have available. No re-trapping, As previously discussed, boosting would also get rid of the single-step trap for most instructions. no extra instruction patching x86_64 rip-relative instructions are the only ones we alter. and complex maintenance of trampolines. - It's as transparent as it gets - no user-space trampoline or other visible state that modifies behavior or can be stomped upon by user-space bugs. The XOL vma isn't writable from user space, so I can't think of how it could be clobbered merely by a stray memory reference. Yes, it's a vma that the unprobed app would never have; and yes, a malicious app or kernel module could remove it or alter the protection and scribble on it. We don't try to defend the app against such malicious attacks, but we do our best to ensure that the kernel side handles such attacks gracefully. - Lightweight and simple probe insertion: no weird setup sequence needing the stopping of all tasks to install the trampoline. We just add the INT3 and off you go. FWIW, we don't stop all threads to set up or extend the XOL vma, which is typically a one-time event. We just grab a mutex, in case multiple threads hit previously-unhit probepoints simultaneously, and simultaneously decide that the XOL area needs to be created or extended. - Emulation is evidently thread-safe, SMP-safe, etc. as it only acts on task local state. The posted uprobes implementation is, so far as we can tell through code inspection and testing, also thread-safe and SMP-safe. - The points we can probe are never truly limited as it's all freely upscalable: if you cannot probe an instruction you want to probe today, extend the emulator. I don't see how ripping out existing support for almost* the entire instruction set, and then putting it back instruction by instruction, patch by patch, is a win. Even if we add emulation, it seems sensible to keep the XOL approach as a backup to handle instructions that aren't yet emulated (and architectures that don't yet have emulators). That way, if you don't probe any unemulated instructions, the XOL vma is never created. Deny the rest. _All_ versions of uprobes code i've seen so far already restricts the probe-compatible instruction set: *Yes, we currently decline to probe some instructions that look troublesome and we haven't taken the time to test. These include things like privileged instructions, int*, in*/out*, and instructions that fuss with the segment registers. We've never actually seen such instructions in user apps. RIP-relative instructions are excluded on 64-bit for example. No. As discussed in previous posts, we handle rip-relative instructions. - Emulation has the _least_ semantical side effects as we really execute 'that' instruction - It seems to me that emulation is the only approach that DOESN'T execute the probed instruction. not some other instruction put elsewhere into a special vma or into the process/thread stack, or some special in-kernel trampoline, etc. - Emulation can be very fast for the common case as well. Nobody will probe weird, complex instructions. They will use 'perf probe' to insert probes into their functions 90% of the time ... - FPU and complex ops and pagefault emulation is not really what i'd expect to be necessary for simple probing - but it _can_ be added by people who care about it, if they so wish. **In practice, we've had to probe all sorts of instructions, including FP instructions -- especially where you want to exploit the debug info to get the names, types, and locations of variables and args. For some compilers and architectures, the debug info isn't reliable until the end of the function prologue, at which point you could find any old instruction. Ditto if you want to probe statements within a function. Such a scheme would be _far_ more preferable form a maintenance POV as well, as the initial code will be small, and we can extend it gradually. All the
KONYA'NIN EN B�Y�K FİRMA REHBERİNE KAYIT OLUN
KONYANIN EN BUYUK FiRMA REHBERiNE KAYIT OLUN www.buyukkonyafirmarehberi.com