To be notified of breakpoint and single-step traps, uprobes currently uses a utrace report_signal callback. Per the data below, maybe we need to intercept the trap before it turns into a signal -- preferably via a new utrace event callback. I think this is already on Roland's TODO list (and discussed on a July 23, 2007 conference call). Anyway, here's some motivation.
Jim Keniston -------- Forwarded Message -------- From: jkenisto at us dot ibm dot com <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bug uprobes/5660] New: uprobed multithreaded app serializes in signal-handling code Date: 23 Jan 2008 00:49:07 -0000 Uprobing a multithreaded app on an x86_64 SMP system shows serious serialization of the threads in the kernel's signal-handling code. In the app in question, the child threads just call a dummy function repeatedly; the uprobes module probes the dummy function's entry point. Here's a summary of data reported by oprofile. It shows that with more than one thread running, utrace_get_signal(), get_signal_to_deliver(), and force_sig_info() are the top three consumers of CPU time. I'm guessing that the threads are serializing on task_struct->sighand->siglock (which is shared among tasks of the same process). #CPUs: 4 pct (rank) pct (rank) pct (rank) threads usec/iter** utrace_get_signal get_signal_to_deliver force_sig_info 1* 4.4 12.2% (1) 2.4% (13) < 1% 1 4.0 12.0% (1) 3.5% (7) < 1% 2 9.2 21.4% (1) 13.2% (2) 5.7% (3) 3 19.0 30.9% (1) 24.4% (2) 13.5% (3) 4 29.7 36.7% (1) 25.6% (2) 14.4% (3) *single-thread program -- no parent thread ** Divide by #threads to get usec per probe hit. Percentages are of total kernel+user time. I have no particular reason to think that this problem is specific to x86_64. I've observed poor scaling on multithreaded apps before, but never got around to pointing oprofile at it. I was hoping it was something we could fix in uprobes. :-|