* Alexey Dobriyan ([EMAIL PROTECTED]) wrote: > On Sun, Aug 12, 2007 at 11:07:04AM -0400, Mathieu Desnoyers wrote: > > Immediate values are used as read mostly variables that are rarely updated. > > They > > use code patching to modify the values inscribed in the instruction stream. > > It > > provides a way to save precious cache lines that would otherwise have to be > > used > > by these variables. > > > > There is a generic _immediate_read() version, which uses standard global > > variables, and optimized per architecture immediate_read() implementations, > > which use a load immediate to remove a data cache hit. When the immediate > > values > > functionnality is disabled in the kernel, it falls back to global variables. > > > > It adds a new rodata section "__immediate" to place the pointers to the > > enable > > value. Immediate values activation functions sits in kernel/immediate.c. > > > > Immediate values refer to the memory address of a previously declared > > integer. > > This integer holds the information about the state of the immediate values > > associated, and must be accessed through the API found in linux/immediate.h. > > > > At module load time, each immediate value is checked to see if it must be > > enabled. It would be the case if the variable they refer to is exported from > > another module and already enabled. > > > > In the early stages of start_kernel(), the immediate values are updated to > > reflect the state of the variable they refer to. > > > > * Why should this be merged * > > > > It improves performances on heavy memory I/O workloads. > > > > An interesting result shows the potential this infrastructure has by > > showing the slowdown a simple system call such as getppid() suffers when it > > is > > used under heavy user-space cache trashing: > > > > Random walk L1 and L2 trashing surrounding a getppid() call: > > (note: in this test, do_syscal_trace was taken at each system call, see > > Documentation/immediate.txt in these patches for details) > > - No memory pressure : getppid() takes 1573 cycles > > - With memory pressure : getppid() takes 15589 cycles > > > > We therefore have a slowdown of 10 times just to get the kernel variables > > from > > memory. Another test on the same architecture (Intel P4) measured the memory > > latency to be 559 cycles. Therefore, each cache line removed from the hot > > path > > would improve the syscall time of 3.5% in these conditions. > > I still think this is bad idea: > 1) already existing CPU erratas against popular models. As I learned > yesterday -- SILENT reboot (often) or hang (rare).
I understand your fears. Actually, I had the same reaction when I heard about the djprobes project. What I am doing with the immediate values are a quite simpler version of live code patching which limits what has to be patched to a very simple subset of instructions which layout (alignment, etc) can be controlled. I have taken the said erratas into consideration and developed the algorithms that makes sure they won't be triggered. I would suggest that you present test cases that proves your fear right, or if you could cite documentation of the specific CPU models upon which you base your comment. Reproducing the errata both on PIII and Intel Core2 duo seems very hard to achieve. If anyone out there have a "known" test case to trigger the fault, I would gladly do some more testing to show the behavior without my algorithm (non working) vs with my algorithm in the same conditions. > 2) new type pretending to be standard and idiomatic -- immediate_t Representing the data referred to by immediate_reads seems important to me, because not doing so would result in people updating the variable directly without using immediate_set, which, of course, would not update all the references to the variable. If you have better suggestions, I am open to them. > 3) new almost-C-keyword -- immediate_if. What does it mean? You'll never > guess just by looking at it. It's a special if() statement that uses its argument (an immediate value) as a condition for a branch. The reason why it's not under the form if (immediate_read(var)) is because I want to leave room from improvement if gcc guys ever want to implement nop/jmp based branches that could be patchable at runtime. > 4) numbers! it's very entertaining to look at numbers with 4 digits > after comma and without any attempts to do error estimates. Here are the test redone with error estimates : 10 runs of 100 iterations each: (I don't have more time to do the 10000 iterations I've done the first time, sorry, stats will take care of showing the errors appropriately). Tests done on a 3GHz P4. Here I run getppid with syscall trace inactive, comparing memory pressure and w/o memory pressure. (sorry, my system is not setup to execute syscall_trace this time, but it will make the point anyway). No memory pressure Reading timestamps: 150.92 cycles, std dev. 1.01 cycles getppid: 1462.09 cycles, std dev. 18.87 cycles With memory pressure Reading timestamps: 578.22 cycles, std dev. 269.51 cycles getppid: 17113.33 cycles, std dev. 1655.92 cycles Now for memory read timing: (10 runs, branches per test: 100000) Memory read based branch: 644.09 cycles, std dev. 11.39 cycles L1 cache hit based branch: 88.16 cycles, std dev. 1.35 cycles So, now that we have the raw results, let's calculate: Memory read: 644.09±11.39 - 88.16±1.35 = 555.93±11.46 cycles Getppid without memory pressure: 1462.09±18.87 - 150.92±1.01 = 1311.17±18.90 cycles Getppid with memory pressure: 17113.33±1655.92 - 578.22±269.51 = 16535.11±1677.71 cycles Therefore, if we add 2 markers not based on immediate values to the getppid code, which would add 2 memory reads, we would add 2 * 555.93±12.74 = 1111.86±25.48 cycles Therefore, 1111.86±25.48 / 16535.11±1677.71 = 0.0672 relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2)) = 0.1040 absolute error: 0.1040 * 0.0672 = 0.0070 Therefore: 0.0672±0.0070 * 100% = 6.72±0.70 % We can therefore affirm that adding 2 markers to getppid, on a system with high memory pressure, would have a performance hit of at least 6.0 % on the system call time, all within the uncertainty limits of these tests. The same applies to other kernel code paths. The smaller those code paths are, the highest the impact ratio will be. > 5) examples! > > DEFINE_IMMEDIATE_TYPE(struct task_struct*, immediate_task_struct_ptr_t); > immediate_task_struct_ptr_t myptr; > > this is close to hungarian notation, and threats were made to use > immediate stuff everywhere. Same reason as 2). Note that the standard case is immediate_int_t, immediate_long_t... and is not as bad as having to express the whole pointer type in the type name. That's the best tradeoff between restricting direct update of these pointers and code readability I have seen. I am open to better solutions... > 6) hundreds lines of new code playing with dynamic modifying > > > For what you ask? > > > For 1 (one) cacheline saving in schedule() which > unlike getpid/getppid is non-trivial function, so all rosy numbers about > speed savings aren't really applicable. And nobody measured how big it is > on macro benchmark. > > > So, this stuff should be put into CoolHacks case and put aside. > It just not worth it! > I think you are missing the whole point there. The scheduler profiling patch is only an example of how immediate values can be used. Timer stats would perfectly fit too, blktrace, .... actually, any feature that is very useful to have in a distro, but where the choice much be made at compile time can now be temporarily activated at runtime when needed. "unlike getpid/getppid is non-trivial function" : I am instrumenting a *lot* of kernel paths with LTTng, including do_syscall_entry and do_syscall_exit, which happen to be executed upon *all* system calls when tracing is active. I used getppid (tracing inactive, but do_syscal_trace forced) to exemplify the performance difference of a simple system call, but you must note that it applies to _every_ system call in the kernel. I've seen people complain when impact of a dormant tracer on the select() system call is noticeable _at all_, so, yes, a 3.5% impact on getppid is meaningful. > A side note, folks at KS can discuss barrier for merging speed improvements. > My feelings are around 1%. > Should they discuss what performance deterioration is acceptable for new infrastructures such as tracer too ? In the end, they may decide that "it can be compiled away anyway", but distros will not ship kernels with features that slows down the kernel and people will still complain that Linux does not ship with a tracing alternative comparable to Dtrace. The question that arises when selecting a new feature is not "what is the average performance improvement of this", but rather "what is the worse case performance hit". This is exactly what I address. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/