Re: [RFC] [PATCH 0/7] UBP, XOL and Uprobes [ Summary of Comments and actions to be taken ]
Peter Zijlstra wrote: On Fri, 2010-01-22 at 12:32 +0530, Srikar Dronamraju wrote: 2. XOL vma vs Emulation vs Single Stepping Inline vs using Protection Rings. XOL VMA is an additional process address vma. This is opposition to add an additional vma without user actually requesting for the same. XOL vma and single stepping inline are the two architecture independent implementations. While other implementations are more architecture specific. Single stepping inline wouldnt go well with multithreaded process. Even though XOL vma has its own issues, we will go with it since other implementations seem to have more complications. we would look forward to implementing boosters later. Later on, if we come across another techniques with lesser side-effects than the XOL vma, we would switch to using them. How about modifying glibc to reserve like 64 bytes on the TLS structure it has and storing the ins and possible boost jmp there? Since each thread can only have a single trap at any one time that should be enough. Hmm, it is a good idea. Well, we'll have a copy of original insn in kernel, but it could be simpler than managing XOL vma. :-) Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Frederic Weisbecker wrote: On Tue, Jan 19, 2010 at 09:47:45AM -0800, Jim Keniston wrote: Do you have plans for a variant that's completely in userspace? I don't know of any such plans, but I'd be interested to read more of your thoughts here. As I understand it, you've suggested replacing the probed instruction with a jump into an instrumentation vma (the XOL area, or something similar). Masami has demonstrated -- through his djprobes enhancement to kprobes -- that this can be done for many x86 instructions. What does the code in the jumped-to vma do? Is the instrumentation code that corresponds to the uprobe handlers encoded in an ad hoc .so? Once the instrumentation is requested by a process that is not the instrumented one, this looks impossible to set a uprobe without a minimal voluntary collaboration from the instrumented process (events sent through IPC or whatever). So that looks too limited, this is not anymore a true dynamic uprobe. Agreed. Since uprobe's handler must be running in kernel, we need to jump into kernel space anyway. Booster (just skips a single-stepping(trap) exception) may be useful for improving uprobe performance. And also as Andi said, using jump instead of int3 in userspace has 2GB address space limitation. It's not a problem for kernel inside, but a big problem in userspace. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Jim Keniston wrote: Not really. For #3 (boosting), you need to know everything for #2, plus be able to compute the length of each instruction -- which we can now do for x86. To emulate an instruction (#4), you need to replicate what it does, side-effects and all. The x86 instruction set seems to be adding new floating-point instructions all the time, and I bet even Masami doesn't know what they all do, but so far, they all seem to adhere to the instruction-length rules encoded in Masami's instruction decoder. Actually, current x86 decoder doesn't support FP(x87) instructions.(even it already supported AVX) But I think it's not so hard to add it. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 1/7] User Space Breakpoint Assistance Layer (UBP)
Jim Keniston wrote: On Mon, 2010-01-18 at 10:58 -0500, Masami Hiramatsu wrote: Jim Keniston wrote: Not really. For #3 (boosting), you need to know everything for #2, plus be able to compute the length of each instruction -- which we can now do for x86. To emulate an instruction (#4), you need to replicate what it does, side-effects and all. The x86 instruction set seems to be adding new floating-point instructions all the time, and I bet even Masami doesn't know what they all do, but so far, they all seem to adhere to the instruction-length rules encoded in Masami's instruction decoder. Actually, current x86 decoder doesn't support FP(x87) instructions.(even it already supported AVX) But I think it's not so hard to add it. At one point I verified that it worked for all the x87 instructions in libm: https://www.redhat.com/archives/utrace-devel/2009-March/msg00031.html I'm pretty sure I tested mmx instructions as well. But I guess this was before you rearranged the opcode tables. Yeah, it wouldn't be hard to add back in, at least for purposes of computing instruction lengths. objdump -d /lib/libm.so.6 | awk -f arch/x86/tools/distill.awk | ./test_get_len Succeed: decoded and checked 37198 instructions Hmm, yeah, that's already supported :-D. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes
Steven Rostedt wrote: On Tue, 2010-01-12 at 05:54 +0100, Frederic Weisbecker wrote: Now what if I want to launch ls and want to profile a function inside. What can I do with a trace event. I can't create the probe event based on a pid as I don't know it in advance. I could give it the ls cmdline and it manages to activate on the next ls launched. This is racy as another ls can be launched concurrently. You make a wrapper script: #!/bin/sh add probe to ls with pid $$ exec $* I do this all the time to limit the function tracer to a specific app. #!/bin/sh echo $$ /debug/tracing/set_ftrace_pid echo function /debug/tracing/current_tracer exec $* I recommend you to add below line at the end of the script, from my experience. :) echo nop /debug/tracing/current_tracer Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: [RFC] [PATCH 7/7] Ftrace plugin for Uprobes
Masami Hiramatsu wrote: Steven Rostedt wrote: On Tue, 2010-01-12 at 05:54 +0100, Frederic Weisbecker wrote: Now what if I want to launch ls and want to profile a function inside. What can I do with a trace event. I can't create the probe event based on a pid as I don't know it in advance. I could give it the ls cmdline and it manages to activate on the next ls launched. This is racy as another ls can be launched concurrently. You make a wrapper script: #!/bin/sh add probe to ls with pid $$ exec $* I do this all the time to limit the function tracer to a specific app. #!/bin/sh echo $$ /debug/tracing/set_ftrace_pid echo function /debug/tracing/current_tracer exec $* I recommend you to add below line at the end of the script, from my experience. :) echo nop /debug/tracing/current_tracer Oops, my bad, it doesn't work after exec... But, it is very important to disable function tracer after tracing target process. So, perhaps, below script may work. #!/bin/sh (echo $BASHPID /debug/tracing/set_ftrace_pid echo function /debug/tracing/current_tracer exec $*) echo nop /debug/tracing/current_tracer Thanks, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America), Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: instruction-analysis API(s)
Hi Jim, Jim Keniston wrote: On Fri, 2009-02-27 at 16:20 -0500, Masami Hiramatsu wrote: ... Here are a patch against your code and an example code for instruction length decoder. Curiously, KVM's instruction decoder does not completely cover all instructions(especially, Jcc/test...). I had to refer Intel manuals. Moreover, even with this patch, the decoder is incomplete. - this doesn't cover 3bytes opcode yet. - this doesn't decode sib, displacement and immediate. - might have some bugs :-( Thank you, Thanks for your work on this. Comments below. Thank you very much for review! Actually, that code was based on KVM code, so I also found many opcodes were not supported. As I mentioned in private email, you or I should probably refactor this into: - insn_get_sib() - insn_get_displacement() - insn_get_immediate() - insn_get_length() Agreed, these should be supported. I also would like to change struct insn as below; struct insn { struct insn_field prefixes; /* prefixes.value is a bitmap */ struct insn_field opcode; /* opcode.bytes[n] == opcode_n */ struct insn_field modrm; struct insn_field sib; struct insn_field displacement; union { struct insn_field immediate; struct insn_field moffset1; /* for 64bit MOV */ struct insn_field immediate1; /* for 64bit imm or off16/32 */ }; union { struct insn_field moffset2; /* for 64bit MOV */ struct insn_field immediate2; /* for 64bit imm or seg16 */ }; u8 opnd_bytes; u8 addr_bytes; u8 length; bool x86_64; const u8 *kaddr;/* kernel address of insn (copy) to analyze */ const u8 *next_byte; }; opcode2 and opcode3 will be stored in opcode.value with opcode1. Now, I'm updating my code. Would anyone also be working on it? Thank you, Jim plain text document attachment (insn_x86.patch) Index: insn_x86.h === --- insn_x86.h (revision 1510) +++ insn_x86.h (working copy) @@ -66,6 +66,10 @@ struct insn_field displacement; struct insn_field immediate; +u8 op_bytes; I'd probably use opnd_bytes and addr_bytes here, for clarity. (When I first saw op, I thought opcode.) Also, we should clarify that these are the EFFECTIVE lengths, not the lengths of the immediate and displacement fields in the instruction. +u8 ad_bytes; +u8 length; + const u8 *kaddr;/* kernel address of insn (copy) to analyze */ const u8 *next_byte; bool x86_64; @@ -75,6 +79,7 @@ extern void insn_get_prefixes(struct insn *insn); extern void insn_get_opcode(struct insn *insn); extern void insn_get_modrm(struct insn *insn); +extern void insn_get_length(struct insn *insn); #ifdef CONFIG_X86_64 extern bool insn_rip_relative(struct insn *insn); Index: insn_x86.c === --- insn_x86.c (revision 1510) +++ insn_x86.c (working copy) @@ -17,7 +17,7 @@ * * Copyright (C) IBM Corporation, 2002, 2004, 2009 */ - +#include linux/module.h #include linux/string.h // #include asm/insn.h #include insn_x86.h @@ -34,6 +34,11 @@ insn-kaddr = kaddr; insn-next_byte = kaddr; insn-x86_64 = x86_64; +insn-op_bytes = 4; +if (x86_64) +insn-ad_bytes = 8; +else +insn-ad_bytes = 4; } EXPORT_SYMBOL_GPL(insn_init); @@ -79,10 +84,51 @@ break; prefixes-value |= pfx; } +if (prefixes-value X86_PFX_OPNDSZ) { +/* oprand size switches 2/4 */ +insn-op_bytes ^= 6; +} +if (prefixes-value X86_PFX_ADDRSZ) { +/* address size switches 2/4 or 4/8 */ +#ifdef CONFIG_X86_64 +if (insn-x86_64) +insn-op_bytes ^= 12; +else +#endif +insn-op_bytes ^= 6; This seems wrong. You're checking the address-size prefix, but adjusting the operand size. +} +#ifdef CONFIG_X86_64 +if (prefixes-value X86_PFX_REXW) +insn-op_bytes = 8; +#endif prefixes-got = true; } EXPORT_SYMBOL_GPL(insn_get_prefixes); +static bool __insn_is_stack(struct insn *insn) It's not entirely clear to me what this function checks. (A more precise name might help.) You have pushes, pops, and calls here, but you also have some instructions that don't appear to affect the stack at all. And other push and pop instructions are missing. +{ +u8 reg; +if (insn-opcode.nbytes == 2) +return 0; The following are 2-byte pushes or pops: 0f-a0, 0f-a1, 0f-a8, and 0f-a9. Also, since the return value is bool, I'd prefer to see true/false rather than 1/0. + +switch(insn-opcode1) { +case 0x68
Re: instruction-analysis API(s)
Jim Keniston wrote: Hi, Masami. Ananth, Srikar, Maneesh, and I talked about this last night. While I was on vacation, Srikar did further investigation into adapting x86 kvm's instuction analysis for more general use, and he's not optimistic. For the short term, at least (i.e., between now and the Linux Foundation Collaboration Summit in April), we're going to proceed based on the prototype I developed. As you noted, djprobes needs instruction lengths, and my prototype doesn't provide that info. (Uprobes computes instruction lengths for rip-relative x86_64 instructions, but that's only a subset of what you need.) Are you interested in extending/enhancing my prototype to make it useful for djprobes? If so, I'd be happy to consult. Hi Jim, Thank you for considering djprobe. Actually, I'm developing insn_get_length() based on your prototype and porting djprobe on it. After tested code, I'd like to post the insn_x86 code. Thank you, Thanks. Jim -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhira...@redhat.com
Re: global tracing
Hi, Roland McGrath wrote: Actually, this point is where I'm stuck on these weeks. If we add marker or tracepoint to trace every syscalls, we might have to put it in the tracehook or audit and set TIF_SYSCALL_TRACE for every process, or put tracepoint in the syscall entrance/exit asm-code and check another flag. Since latter adds additional flag-checking in fast-path, I think it is not acceptable. I agree completely that it would be wrong to do any new arch work for this, especially assembly hacking. Certainly piggy-backing on the existing TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT in some fashion is the way to go. If you don't need complete user register access at your tracepoint, then the audit path is an option. I suspect you do, and so TIF_SYSCALL_TRACE is what to use. Then you can put tracepoints in tracehook_report_syscall_*. Actually, I did it and found it is not simple to hook audit syscall. It seems that audit flag is not synchronously cleared/set on processes with audit_context. I think tracehook is better and simpler way to do that. But there is still some audit-related problem when I set TIF_SYSCALL_TRACE flag on every process, and I'm investigating that. Maybe I need to improve syscall audit. It's straightforward to write a loop to set TIF_SYSCALL_TRACE on every task. The only wrinkle is dealing with clearing the flag correctly. You don't need a loop, because it can be cleared lazily by each thread when it gets into the slow path and finds it has no reason to be there. This is not very hard. It only requires adding a few lines in the utrace code to check your global-syscall-trace flag in deciding when to clear TIF_SYSCALL_TRACE. That's a good idea. I'll check that. This would be unlike a plain tracepoint only in that you have to make this explicit call to switch it on and off. (Maybe this could be rolled into the tracepoint probe registration API.) Sure, even though, we can enable it when initializing tracepoint-marker conversion module. I'm not at all arguing against having utrace global tracing to provide you this feature instead. (I already raised the pros/cons about that generally and that discussion can continue.) But this is how you'd do it sensibly with tracepoints IMHO. (The details I just described are not much different from what utrace global tracing would have for handling TIF_SYSCALL_TRACE.) I agree that. I think if I can set TIF_SYSCALL_TRACE on each process safely, it can work with utrace global tracing too. In that case, I can move to utrace global tracing feature. Thank you, Thanks, Roland -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: [EMAIL PROTECTED]