This is my first post here. Apologies in advance if I don't get the etiquette quite right.
Background ========== We have an application which is exhibiting very large performance problems under some loads. We noticed extremely high context switch rates under the load in question. Rates can be up to ~1.2e6 switches / second (as reported by sar -w) on a 4 physical core hyper-threaded Xeon 3.4GHz machine. While we don't yet know if this is the root cause of the performance problems, we'd very much like to understand what code path is triggering this behavior. Our theory is that the context switching is caused by a few very heavily contended futuxes, but we need to develop evidence to support or disprove this theory. We are trying to use 'perf' to understand what's going on. We have been using the following command to sample the call graph as the load runs : perf record e cs g ‹ <load generator tool> perf report -G We're specifically interested in the context switching behavior, not profiling the application to see where the time is going (we already know where our hot spot is here). NB: I have also used 'perf record -e cs -g -a -- sleep 2', while the application is running as a separate process. The results appear to be similar, except that data for all running processes is included in the report (as you would expect). The problem =========== We have recompiled the production system so it's statically linked, not optimized and has the frame pointers included in the image. The problem is that 'perf report' is generating good symbols for the kernel but the contended symbols we're interested in are showing up as hex addresses. Additionally, the hex values are very low (e.g. 0x14). We could manually resolve these hex addresses, but I checked the contents of the ELF image and there is no data that low in the address space. This makes us suspect that the address data may be incorrect. Example code ============ Since the production system is large, not always well understood and profiling it is giving unexpected results, I thought I'd check the behavior of 'perf' with much simpler code. I tweaked a bit of C I lifted off the web so it causes high context switch rates. I am profiling that simple code to understand how the 'perf' toolchain works. In this case, I'm not seeing the suspicious hex addresses. However, I'm also not seeing the symbols in my test code where the futex calls originate - this is what we ultimately want to see. To double check that this isn't just a display issue, I have looked at the contents of the 'perf.data' file with 'perf script'. It does appear that all the symbols are either in kernel space or the pthreads library. The code and compilation commands are attached to the end of this email. Perf report output ================== # captured on: Wed Dec 4 15:55:25 2013 # hostname : XXXXX # os release : 3.10.19-3.10.0-amd64-11777104 # perf version : 3.10.19-3.10.0-amd64-11777104 # arch : x86_64 # nrcpus online : 8 # nrcpus avail : 8 # cpudesc : Intel(R) Xeon(R) CPU E31270 @ 3.40GHz # cpuid : GenuineIntel,6,42,7 # total memory : 16445052 kB # cmdline : /usr/local/bin/perf record -e cs -g -- ../mutex_example # event : name = cs, type = 1, config = 0x3, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0 # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore_cbox_1 = 7, uncore_cbox_2 = 8, uncore_cbox_3 = 9, breakpoint = 5 # ======== # # Samples: 21K of event 'cs' # Event count (approx.): 616402 # # Overhead Command Shared Object Symbol # ........ ............. ................. ............ # 100.00% mutex_example [kernel.kallsyms] [k] schedule | |--98.95%-- start_thread | | | |--53.06%-- __lll_mutex_lock_wait | | system_call | | SyS_futex | | do_futex | | futex_wait | | | | | |--56.41%-- futex_wait_setup | | | get_futex_key | | | __lock_page | | | __wait_on_bit_lock | | | sleep_on_page | | | io_schedule | | | schedule | | | | | --43.59%-- futex_wait_queue_me | | schedule | | | |--46.94%-- __lll_mutex_unlock_wake | | system_call | | SyS_futex | | do_futex | | futex_wake | | get_futex_key | | __lock_page | | __wait_on_bit_lock | | sleep_on_page | | io_schedule | | schedule | --0.00%-- [...] | |--0.51%-- __lll_mutex_unlock_wake | | | |--97.17%-- system_call | | schedule | | | --2.83%-- retint_careful | schedule --0.54%-- [...] It may also be worth mentioning that 'perf annotate' shows the disassembly of only the 'schedule' subroutine, with 100% of the events marked against 1 instruction, which seems strange. Environment information ======================= Machine: 4 core, 8 thread Intel Xeon, 3.4GHz, 16GB RAM Kernel version: derivative of 3.10.0 (I'm not a member of the team that makes the changes, so I'm not familiar with what the alterations are.) Base distribution : Ubuntu 8.04 LTS Compiler used: gcc (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3 What I'm asking for =================== I'd like to ask for your advice on how to drill down into the user space, so we can identify exactly where in our code the context switching originates. I was envisaging a 2 step process: 1) Understand why I'm not seeing my test code symbols in the call graph. If I can get the info I want for the test code, then 2) Move on to the production code. I'll need to understand why the hex values seem unusual. It may well be that I'm missing something obvious or misunderstanding what 'perf' is telling me - I'm not terribly experienced with this level of debug. Finally, if it turns out that the 'perf' tools need enhancements, I'm willing to chip in with the development. I'd like to expand my skills, and performance analysis is a big part of my job. Thanks in advance, Nick Pavey =========== =========== Sample code =========== =========== #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #define NUM_THREADS 4 #define TCOUNT 100000000 #define COUNT_LIMIT 100000000 int count = 0; pthread_mutex_t count_mutex; pthread_cond_t count_threshold_cv; void *inc_count(void *t) { int i; long my_id = (long)t; printf("Starting inc_count(): thread %ld\n", my_id); for (i=0; i<TCOUNT; i++) { pthread_mutex_lock(&count_mutex); count++; /* Check the value of count and signal waiting thread when condition is reached. Note that this occurs while mutex is locked. */ if (count == COUNT_LIMIT) { pthread_cond_signal(&count_threshold_cv); } pthread_mutex_unlock(&count_mutex); /* NB : I'm deliberately not sleeping here, because I want to increase the context switch rate. In a 'real' application, this is where we would do the time consuming work. sleep(1); */ } pthread_exit(NULL); } void *watch_count(void *t) { long my_id = (long)t; printf("Starting watch_count(): thread %ld\n", my_id); /* Lock mutex and wait for signal. Note that the pthread_cond_wait routine will automatically and atomically unlock mutex while it waits. Also, note that if COUNT_LIMIT is reached before this routine is run by the waiting thread, the loop will be skipped to prevent pthread_cond_wait from never returning. */ pthread_mutex_lock(&count_mutex); while (count<COUNT_LIMIT) { pthread_cond_wait(&count_threshold_cv, &count_mutex); count += 125; } pthread_mutex_unlock(&count_mutex); pthread_exit(NULL); } int main (int argc, char *argv[]) { int i; pthread_t threads[NUM_THREADS]; pthread_attr_t attr; /* Initialize mutex and condition variable objects */ pthread_mutex_init(&count_mutex, NULL); pthread_cond_init (&count_threshold_cv, NULL); /* For portability, explicitly create threads in a joinable state */ pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); pthread_create(&threads[0], &attr, watch_count, (void *) 1); /* Note that the counter starts at 1, to ensure we don't overwrite the 'watch_count' thread pointer, which is at index 0 */ for (i=1; i < NUM_THREADS; i++) { pthread_create(&threads[i], &attr, inc_count, (void *) i + 1); } /* Wait for all threads to complete */ for (i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], NULL); } printf ("Main(): Waited on %d threads. Done.\n", NUM_THREADS); /* Clean up and exit */ pthread_attr_destroy(&attr); pthread_mutex_destroy(&count_mutex); pthread_cond_destroy(&count_threshold_cv); pthread_exit(NULL); } -------------------- Compilation commands -------------------- gcc -c mutex_example.c -ggdb -O0 -fno-omit-frame-pointer \ -Wall -Wextra -Winit-self -Wsign-compare -ansi -pedantic \ -o mutex_example.o gcc -lpthread mutex_example.o -o mutex_example