http://d.puremagic.com/issues/show_bug.cgi?id=3742
Summary: Please add support for 'Lightweight Profiling' which adds a set of user-controlled counters to the AMD64 architecture Product: D Version: future Platform: x86_64 OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: DMD AssignedTo: nob...@puremagic.com ReportedBy: nick.barbal...@gmail.com --- Comment #0 from nick barbalich <nick.barbal...@gmail.com> 2010-01-25 16:44:15 PST --- Late in 2007, AMD announced Lightweight Profiling as a proposed extension to the AMD64 architecture that would allow an application to gather performance statistics about itself with low overhead. We [AMD] posted the preliminary specification and asked for feedback from the developer community. Much to our delight, many of you responded with comments, criticisms, and suggestions on the proposal. We've read all of your feedback, and last week we posted the current version of the LWP specification. The announcement and the link to the spec are here. Thanks to all of you who helped us out. What came before... It's important to be able to measure the details of a program's performance in order to find ways to speed it up. Until now, there have been just two ways to do this. The first is via instrumentation, i.e., adding code to the program to watch the clock, or the cycle counter, or just to count the number of times an instruction or loop is executed. Instrumentation can be added by the programmer or by a compiler. Unfortunately, it seriously perturbs the application, and the instrumented code usually doesn't have the same characteristics as the original code, especially when dealing with the data and instruction caches. Also, instrumentation can't observe the hardware caches, so it can't gather data about cache behavior. The second traditional method of monitoring performance is to use the hardware performance counters. These count hardware events and generate an interrupt after a programmed number of events have happened. The counters can report on events that are too hard to instrument (like counting each x86 instruction) or are not visible to software (like cache misses). These counters are used by the AMD CodeAnalyst Performance Analyzer and provide deep insight into application and system performance. However, each time a data sample is gathered, the processor must take an interrupt to a kernel-mode driver, and that takes hundreds or thousands of cycles. The driver, by simply executing, changes the contents of the data cache and the instruction cache and may perturb the application's performance. The counters can only be configured, started, and stopped from kernel mode, so an application must call a driver or the operating system to control them. Finally, some systems do not context-switch the performance counters when changing threads or processes, and on those systems, performance monitoring can only be done globally by a single user at a time. Introducing LWP After reading about current technology, you might think that an ideal performance monitor should: * Operate entirely in user mode * Cause little or no perturbation of the application * Be controlled separately for each thread * Have low overhead to allow for higher sampling rates And that describes LWP! Lightweight Profiling adds a set of user-controlled counters to the AMD64 architecture. They can monitor multiple events simultaneously. An application thread starts profiling by providing the address of an LWP control block (LWPCB) as the operand to the new LLWPCB instruction. The contents of the LWPCB specify which events to count and how often to count them. It also points to a ring buffer in the application's memory into which the hardware will store event records. That's it. Once started, LWP counts the specified events. When an event counter underflows, it stores an event record at the head of the ring buffer and resets the counter. (If requested, LWP randomizes the bottom bits of the new counter value to prevent "beating" against constant length loops.) LWP stores the record without interrupting the flow of the program, so the only perturbation to the program's performance is writing the record (usually affecting only a single data cache line) and a few cycles to perform the write. The record contains the event type, the address of the instruction that caused the underflow, and other information about the event. All event types share one ring buffer and can be sorted out by the event type field in the record. Of course, eventually the buffer will fill up. What then? Well, a program has two options for emptying the ring buffer. First, it can simply poll the buffer and remove event records from the tail of the ring. When software rewrites the tail pointer, the LWP hardware knows it can reuse the newly emptied region of the ring buffer. Since the buffer is in user memory, the program can even share the memory with another process, and that second process can be responsible for draining the buffer. Second, the application can specify that it wants LWP to generate an interrupt when the ring buffer is filled past a certain threshold. For instance, it can configure a buffer to hold 10,000 event records and tell LWP to interrupt whenever there are more than 9,000 records in the buffer. The interrupt does indeed perturb the program, but it does so 1/9000th as often as the traditional performance counters would. Better still, since the buffer is in user memory, the application can catch the interrupt and do whatever it wants with the data. It can store it to disk for later analysis, or it can process it immediately and even try to fix performance problems as they are happening. In addition, LWP is a per-thread feature. Each thread on the system can be monitoring different events at different rates without interference. If a thread is not using LWP, there is no impact on its performance even if other threads have LWP active. Some LWP Details The LWP events are a small subset of the events available in the traditional performance counters. They include Instructions Retired, Branches Retired, and DCache Misses. The Branches Retired event can be filtered by whether the branch is direct or indirect, conditional or unconditional, or other criteria. It captures the target address of the branch, a useful value when looking at indirect branches. The DCache miss event can be filtered by cache level to capture only "expensive" cache misses. One exciting feature of LWP is the ability to insert events into the ring buffer under program control. There are two new instructions to do this: * LWPINS inserts a record into the ring buffer containing data taken from the arguments to the instruction. A program can use LWPINS to insert a marker to indicate an important event, such as loading or unloading a shared library, that influences the way addresses should be interpreted in subsequent event records. * LWPVAL uses an event counter and decrements the counter each time it is executed, much the way the hardware event counters work. When the counter underflows, it inserts a record into the ring buffer containing data from its arguments. A program uses LWPVAL to implement a technique called value profiling. For instance, it can profile the divisor of a commonly executed DIV instruction and if the data show that the divisor is frequently the same number, it can rewrite the instruction to test for that value and execute an optimized code sequence. Similarly, it can profile the target of a hot indirect branch and generate better code if one way of the branch is dominant. Who will use LWP? LWP can be used in many different application environments. These include: * Managed Runtime Environment: Managed Runtimes (MRTEs) are programming environments such as Java and the Microsoft� .NET Framework. These environments have the ability to generate AMD x86 or x64 code for routines coded in a high level managed language (such as Java or C#), and they can do that on the fly as a program is running. The MRTE can enable LWP and periodically look for performance problems. If (when!) it finds them, it can generate better code for the hot spots and improve the program's overall performance. LWP is lightweight enough that it can run continuously. * Dynamic Optimizer: A Dynamic Optimizer is a program that monitors an application and attempts to improve its performance by modifying it as it runs. In this case, the target application is compiled to native code from a traditional language like C or C++. The Dynamic Optimizer can gather performance data without affecting the flow of control in the application. * Compiler Feedback: Most modern compilers have an option to build an instrumented program which the developer runs to gather information on the program's performance. Unfortunately, the added instrumentation (and the fact that optimization levels are often cranked down in a feedback compilation) perturbs the program so much that what's being measured is substantially different from the "real" program. With LWP, the compiler can gather statistics on the program execution without changes, and it can insert LWPVAL instructions to profile interesting areas without adding a large block of instrumentation code and without clobbering any registers. If the application runs without turning on LWP, the LWPVAL instructions act as NOPs and only take a few cycles. Note the above has been taken from: http://forums.amd.com/devblog/blogpost.cfm?catid=208&threadid=116487&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AmdDeveloperBlogs+%28AMD+Developer+Blogs%29 The latest revision of the Lightweight Profiling specification document (v3.03) is a specification containing updates that are a direct result of AMD community feedback, and can be found here: http://support.amd.com/us/Processor_TechDocs/43724.pdf -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------