For a while now, Fennec and Linux-desktop nightlies have had the built-in SPS-based time profiling using native stack unwinding. The unwinding is done using our in-tree Breakpad copy, which includes a bunch of speedup patches that are slowly making their way upstream.
What we have works, but is less than ideal in a number of ways: (1) It's slow. CFI/EXIDX unwinding with Breakpad costs about 6600 instructions/frame, or around 120+K/instructions per stack trace. That makes it infeasible for high-frequency, low overhead profiling. By comparison, Valgrind's CFI based unwinder costs around 220 instructions per frame -- that's 30 times fewer. This posting https://blog.mozilla.org/jseward/2013/08/29/how-fast-can-cfiexidx-based-stack-unwinding-be/ looks at the reasons behind the discrepancy. In summary, Breakpad is architected to be general, using inheritance (to handle multiple architectures) and std::map (to handle arbitrary register sets), both of which slow it down a lot. It also does a bunch of dynamic memory allocation which is difficult to get rid of in some corner cases (obscure Win32 unwinding hacks) (2) It has a prohibitively large footprint for Fennec and in particular for FirefoxOS. My rough measurements show it taking about 100MB of memory to load the CFI unwind rules for libxul.so on ARM. (3) Breakpad is not multithreaded. Currently SPS can sample multiple threads, but those samples are processed (unwound) by a single unwinder thread. It would be clearly a big throughput advantage to be able to throw multiple cores at the unwinding workload. (4) (not directly related to Breakpad, but..) Current SPS samples threads at some constant rate regardless of whether they are running or blocked in syscalls. The latter constitutes a big waste of compute resources, since most threads are blocked in syscalls most of the time. It would be nice to devise an adaptive strategy that samples blocked threads less often. My feeling is that re-architecting Breakpad to fix (1), (2) and (3) is going to be a difficult and messy exercise, which will essentially throwing away the core of it and starting over. Getting changes on such a scale back upstream could be problematic too. I am very tempted to create a new custom unwind library designed specifically to support SPS. It needs to be fast, lower-footprint, and multithreaded. Unlike Breakpad, it will -- at least initially -- avoid supporting all Tier 1 targets, and concentrate on the most critical cases: ARM/FirefoxOS, ARM/Android (via EXIDX) and (as a side effect) {x86,x86_64}/Linux (via CFI). The unwinding core loop and rule representation would be an all-new implementation. Those are the parts that are time and space critical, but they are also relatively simple. The CFI and EXIDX parsers are something that could be imported from Breakpad, since they work, are difficult to get right, are not performance critical, and there's no point in reimplementing them. I am not sure how to deal with (4). Really it's completely unrelated to the unwind mechanism -- it has to do with SPS' sampling policy at the level above. There have been some ideas floated around to detect when a thread is in a syscall by looking at the instruction at the program counter, since the "enter the kernel" instructions are different from all other ones. That doesn't give a way to avoid on average half a long-interval delay of missed samples when a thread leaves a syscall. Comments, opinions? J _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform