Author: cromwellian Date: Mon Jun 22 14:50:42 2009 New Revision: 5600 Modified: wiki/AdvancedCompilerOptimizations.wiki
Log: Edited wiki page through web user interface. Cleaned up formatting. Fixed spelling/grammar in some places. Added useful Wiki-linked for compiler theory stuff. Added section on Semantic Inlining. Modified: wiki/AdvancedCompilerOptimizations.wiki ============================================================================== --- wiki/AdvancedCompilerOptimizations.wiki (original) +++ wiki/AdvancedCompilerOptimizations.wiki Mon Jun 22 14:50:42 2009 @@ -35,8 +35,8 @@ = Intra-procedural optimization = -GWT already does dead code elimination, constant folding, and copy propagation, but it does so without -regard to control flow and liveness information, so for example, the following fragment: +GWT already does [http://en.wikipedia.org/wiki/Dead_code_elimination dead code elimination], [http://en.wikipedia.org/wiki/Constant_folding constant folding], and [http://en.wikipedia.org/wiki/Copy_propagation copy propagation], but it does so without +regard to [http://en.wikipedia.org/wiki/Control_flow_graph control flow] and [http://en.wikipedia.org/wiki/Liveness_analysis liveness information], so for example, the following fragment: {{{ int x; @@ -45,9 +45,9 @@ Window.alert(""+x); }}} -will not detect that the assignment `x = 2` is dead code. A block level optimizer with dataflow information -at hand will be able to spot more optimization opportunities. A classic example is heavy use of the builder -pattern: +will not detect that the assignment `x = 2` is dead code. A block level optimizer with [http://en.wikipedia.org/wiki/Data_flow_analysis dataflow information] +at hand will be able to spot more optimization opportunities. A classic example is heavy use of the [http://en.wikipedia.org/wiki/Builder_pattern builder +pattern] with method chaining: {{{ Builder b = new Builder(); @@ -89,24 +89,17 @@ b.C = 30; }}} -Performing these optimizations may be easier to do if statement blocks are flattened into a form of three-address code, so that -liveness ranges are easier to compute. +Performing these optimizations may be easier to do if statement blocks are flattened into a form of [http://en.wikipedia.org/wiki/Three_address_code three-address code], so that liveness ranges are easier to compute. == Common Subexpression Elimination == -GWT currently does not perform CSE. It is unclear how much of a difference it will make for well written programs, but some papers -make claims for non-trivial benefits in scientific computing. However, in large code bases, duplications probably do arise, and -there might be cases where repeating an expression may lead to better readability. There may also be unavoidable common subexpressions -that occur as a by-product of other optimizations (like loop induction). +GWT currently does not perform [http://en.wikipedia.org/wiki/Common_subexpression_elimination CSE]. It is unclear how much of a difference it will make for well written programs, but some papers make claims for non-trivial benefits in scientific computing. However, in large code bases, duplications probably do arise, and there might be cases where repeating an expression may lead to better readability. There may also be unavoidable common subexpressions that occur as a by-product of other optimizations (like loop induction). -Probably the most well known way of doing this is with the Value Numbering technique. +Probably the most well known way of doing this is with the [http://en.wikipedia.org/wiki/Global_value_numbering Value Numbering technique]. == Local Variable Allocation == -While Javascript is not machine code and doesn't have registers, the number of live temporaries have an effect on code size, on the gzip/deflate -compression dictionary, on the obfuscator, and perhaps even the Javascript VM. The intention here is to treat Javascript locals as if they -were machine registers (on an infinite register machine) and use traditional register allocation techniques to minimize the number of locals -used in each scope. As an example: +While Javascript is not machine code and doesn't have registers, the number of live temporaries have an effect on code size, on the gzip/deflate compression dictionary, on the obfuscator, and perhaps even the Javascript VM. The intention here is to treat Javascript locals as if they were machine registers (on an infinite register machine) and use traditional [http://en.wikipedia.org/wiki/Register_allocation register allocation] techniques to minimize the number of locals used in each scope. As an example: {{{ String s = "Hello"; @@ -120,17 +113,15 @@ {{{ String s = "Hello"; Window.alert(s); -s2 = "World"; +s = "World"; Window.alert(s2); }}} -Moreover, the need to introduce temporaries to linearize the AST blocks and optimize examples like the Builder pattern, will increase code size -unless there is an extra pass to remove unneeded temporaries. +Moreover, the need to introduce temporaries to linearize the AST blocks and optimize examples like the Builder pattern, will increase code size unless there is an extra pass to remove unneeded temporaries. == Loop Hoisting / Loop Induction == -Loop Hoisting would find loop-invariant expressions and move them out of the body of the loop. Loop Induction finds variables which increase -acording to a linear function of the loop index, and remove the loop index from the equation. +[http://en.wikipedia.org/wiki/Loop-invariant_code_motion Loop Hoisting] would find loop-invariant expressions and move them out of the body of the loop. Loop Induction finds variables which increase according to a linear function of the loop index, and remove the loop index from the equation. Hoisting example: {{{ @@ -170,22 +161,16 @@ }}} This may or may not be faster on JS VMs. It does reduce the number of referenced variables in the -expression for j and changes a multiplication into an addition. +expression for j and changes a multiplication into an addition. There are a number of other [http://en.wikipedia.org/wiki/Loop_optimization loop optimizations] that could be candidates, such as loop fusion. = Prerequisites = -Many compilers have been traditionally written to turn an AST into a flat intermediate representation like three address code, for -example the GNU C compiler. Recently however, GCC introduced a new intermediate representation based on preserving high level -AST structure, called GENERIC/GIMPLE. [http://gcc.fyxm.net/summit/2003/GENERIC%20and%20GIMPLE.pdf GENERIC and GIMPLE: A new tree -representation for entire functions] Since GWT's current optimizations are based strictly on ASTs and not a low-level IR, -it makes sense to follow in GCC's footsteps at this point and avoid a switch to a low level IR which would require a large -re-engineering effort in the compiler. +Many compilers have been traditionally written to turn an AST into a flat intermediate representation like three address code, for example the GNU C compiler. Recently however, GCC introduced a new intermediate representation based on preserving high level AST structure, called GENERIC/GIMPLE. [http://gcc.fyxm.net/summit/2003/GENERIC%20and%20GIMPLE.pdf GENERIC and GIMPLE A new tree +representation for entire functions] Since GWT's current optimizations are based strictly on ASTs and not a low-level IR, it makes sense to follow in GCC's footsteps at this point and avoid a switch to a low level IR which would require a large re-engineering effort in the compiler. == Linearize AST == In order to perform many of these optimizations, a control flow graph is usually needed, that is, you need -to group a sequence of statements into basic blocks, along with connectivity information, but this is difficult to do -with the current AST structure, where a node within a given block, may in fact, represent several basic blocks, with -no easy way to overlay a control flow structure. +to group a sequence of statements into [http://en.wikipedia.org/wiki/Basic_block basic blocks], along with connectivity information, but this is difficult to do with the current AST structure, where a node within a given block, may in fact, represent several basic blocks, with no easy way to overlay a control flow structure. Consider: {{{ @@ -197,8 +182,7 @@ } }}} -The JWhileStatement will be the root node, but in fact, the first thing to be executed is 'a+b', and due to short-circuiting, control -flow within the conditional is actually two blocks. This loop can be visualized as: +The JWhileStatement will be the root node, but in fact, the first thing to be executed is 'a+b', and due to short-circuiting, control flow within the conditional is actually two blocks. This loop can be visualized as: {{{ // BLOCK 1 @@ -212,10 +196,10 @@ j++; // BLOCK4 tcond3 = b > c; if(tcond3) - a = c; + a = c; // BLOCK5 else - a = b; - b = (b * 3; + a = b; // BLOCK6 + b = (b * 3; // BLOCK7 b = b + 1; b = b % 2; c = c/2; @@ -226,38 +210,27 @@ GCC calls this _GIMPLE_ and refers to the process as _gimplifying_. -Linearizing the AST could have detrimental effects on output code as well as other optimizations, since some of the original source -structure will be obscured. To guard against a potentially bad payoff, the linearizing can be done in two phases. +Linearizing the AST could have detrimental effects on output code as well as other optimizations, since some of the original source structure will be obscured. To guard against a potentially bad payoff, the linearizing can be done in two phases. === Phase 1 === -No control flow graph. Only flatten JExpressions by introducing temporaries. Do not flatten loops, conditionals, or other -flow control structures. This will permit many of the intra-block optimizations to be made, but won't enable useful -cross-block dataflow. It chould have some positive benefit for Builder patterns which usually consist of chained -methods with no conditionals, which if inlined fully, would already be a basic block. +No control flow graph. Only flatten JExpressions by introducing temporaries. Do not flatten loops, conditionals, or other flow control structures. This will permit many of the intra-block optimizations to be made, but won't enable useful cross-block dataflow. It chould have some positive benefit for Builder patterns which usually consist of chained methods with no conditionals, which if inlined fully, would already be a basic block. === Phase 2 === -Flatten for, while, do/while, if/else, switch/case, and other high level conditional statements. This may entail rewriting a flow control -construct as a different high level construct. If possible, provide an unflatten routine which can recover the original high level construct (i.e. for vs while) +Flatten for, while, do/while, if/else, switch/case, and other high level conditional statements. This may entail rewriting a flow control construct as a different high level construct. If possible, provide an unflatten routine which can recover the original high level construct (i.e. for vs while) == Control Flow Graph == -After linearizing the AST, it should be straight forward to build traditional flow control graph of basic blocks. The representation is another matter. One -way of doing it is to build separate CFGNodes, composed of CFGBlocks which contain JNodes. This is yet another, parallel datastructure, and may vastly -increase memory usage, although the CFGs are only needed per-procedure and can be thrown away. +After linearizing the AST, it should be straight forward to build traditional flow control graph of basic blocks. The representation is another matter. One way of doing it is to build separate CFGNodes, composed of CFGBlocks which contain JNodes. This is yet another, parallel datastructure, and may vastly increase memory usage, although the CFGs are only needed per-procedure and can be thrown away. -Another possibility is to introduce a variant of JBlock, a JCFGBlock which exists in the Java AST, and can be used to group existing AST nodes, so that JBlocks -are split up and rewritten as JCFGBlocks. +Another possibility is to introduce a variant of JBlock, a JCFGBlock which exists in the Java AST, and can be used to group existing AST nodes, so that JBlocks are split up and rewritten as JCFGBlocks. -JCFGBlocks would contain additional methods, fields, and interfaces to track in/out connections, dataflow information, as well as providing visitor methods -for walking over a CFG. +JCFGBlocks would contain additional methods, fields, and interfaces to track in/out connections, dataflow information, as well as providing visitor methods for walking over a CFG. == Data Flow Analysis == -Once the tree is linearized into basic blocks of roughly three-address-code comparable AST nodes, one of the first pieces of information that can be -gathered is the lifetime of local variables. That is, the range of statements within a block in which the variable has it's value defined, -up to the last time it is referenced. This analysis is made more difficult if variables are not immutable, for example: +Once the tree is linearized into basic blocks of roughly three-address-code comparable AST nodes, one of the first pieces of information that can be gathered is the lifetime of local variables. That is, the range of statements within a block in which the variable has it's value defined, up to the last time it is referenced. This analysis is made more difficult if variables are not immutable, for example: {{{ int x = 20 @@ -267,13 +240,11 @@ Window.alert(y * 2 + x); }}} -In this example, y's live range is between line 2 and line 5. However, x has two live ranges. The first occurs between line 1 -and line 3. The second between line 4 and line 5, because x is redefined on line 4. +In this example, y's live range is between line 2 and line 5. However, x has two live ranges. The first occurs between line 1 and line 3. The second between line 4 and line 5, because x is redefined on line 4. === Static Single Assigment (SSA) Form === -To combat and make liveness analysis easier, we treat subsequent definitions of a variable as a new variable. Thus, we can -rewrite the above example: +To combat and make liveness analysis easier, we treat subsequent definitions of a variable as a new variable. Thus, we can rewrite the above example: {{{ int x = 20 int y = x * 3 + 10; @@ -282,8 +253,7 @@ Window.alert(y * 2 + x2); }}} -Now it is clear that a variable's live range begins at its definition, and ends on the last line that references it. A wrinkle -occurs if there is a conditional: +Now it is clear that a variable's live range begins at its definition, and ends on the last line that references it. A wrinkle occurs if there is a conditional: {{{ x = 5; @@ -308,8 +278,7 @@ Window.alert(x2 if the then clause, x3 if the else clause); }}} -We have no way of representing the choice of one variable or the other in the AST node depending on which path was taken. We can invent -one, let's call it JPhiNode and put it into the AST, which contains a list of all versions of x reaching a certain point; +We have no way of representing the choice of one variable or the other in the AST node depending on which path was taken. We can invent one, let's call it JPhiNode and put it into the AST, which contains a list of all versions of x reaching a certain point; {{{ x = 5; if(expr) { @@ -321,30 +290,21 @@ Window.alert(phi(x2, x3)); }}} -This is a place holder for the compiler. The compiler can't know which path was taken until runtime, and phi is not an executable -statement, but later on when optimizations are over, we can convert out of SSA form, which means translating those phi functions -back to real code. +This is a place holder for the compiler. The compiler can't know which path was taken until runtime, and phi is not an executable statement, but later on when optimizations are over, we can convert out of SSA form, which means translating those phi functions back to real code. For more on SSA and phi-functions see, [http://en.wikipedia.org/wiki/Static_single_assignment_form#Converting_to_SSA Converting to SSA] == More Data Flow Analysis == === Register Allocation / Interference Graph === -With live-range information, we can prune unused variables better, we can also perform 'register allocation' by figuring out -which live-ranges overlap, and which do not, and using that information to minimize the number of local variables used by a -function. As mentioned earlier, this should help out somewhat with the obfuscator and gzip. +With live-range information, we can prune unused variables better, we can also perform 'register allocation' by figuring out which live-ranges overlap, and which do not, and using that information to minimize the number of local variables used by a function. As mentioned earlier, this should help out somewhat with the obfuscator and gzip. See [http://en.wikipedia.org/wiki/Register_allocation#Recent_developments recent developments in register allocation] for some other potential techniques. === Dominators === -Given a block B_j, a block B_i *dominates* B_j if all paths from the start of the program reaching B_j have to pass through -B_i. If B_i dominates B_j, and no block B_x exists which dominates B_j, but is dominated B_i, we call B_i an *immediate dominator*. +Given a block B_j, a block B_i [http://en.wikipedia.org/wiki/Dominator_(node) *dominates*] B_j if all paths from the start of the program reaching B_j have to pass through B_i. If B_i dominates B_j, and no block B_x exists which dominates B_j, but is dominated B_i, we call B_i an *immediate dominator*. -Dominance information enables a number of optimizations. For GWT in particular, dominance information can provide better pruning -for clinits, as well as type-tightening. In particular, if clinitA occurs in a dominator of B_j, it may be pruned from B_j since -it is guaranteed that to reach B_j, clintA must have already be executed earlier. - -In the case of type-tightening, types may be tracked between blocks. If a variable in block B_j is live-in from some block B_i which -dominates B_j (that is, it was defined/assigned in block B_i), and the type of the variable is known to be type T, then it may be -tightened to T. Currently, GWT only does this analysis at the global level and does not track type-flow through multiple stack frames. +Dominance information enables a number of optimizations. For GWT in particular, dominance information can provide better pruning for clinits, as well as type-tightening. In particular, if clinitA occurs in a dominator of B_j, it may be pruned from B_j since it is guaranteed that to reach B_j, clintA must have already be executed earlier. + +In the case of type-tightening, types may be tracked between blocks. If a variable in block B_j is live-in from some block B_i which dominates B_j (that is, it was defined/assigned in block B_i), and the type of the variable is known to be type T, then it may be tightened to T. Currently, GWT only does this analysis at the global level and does not track type-flow through multiple stack frames. That said, these analyses require a whole-program CFG, rather than just a procedural one. @@ -352,8 +312,7 @@ == Object Inlining == -Object Inlining is a technique for eliminating the allocation of an object, by inlining all of its fields into another object, or as -local scalar variables on the stack. Consider the following code: +Object Inlining is a technique for eliminating the allocation of an object, by inlining all of its fields into another object, or as local scalar variables on the stack. Consider the following code: {{{ public class Point { @@ -382,8 +341,7 @@ } }}} -This eliminates the existence of the Point class. If Point had any methods, they could be inlined as specializations (cloned for each instance), and -hopefully pruned after further inlining in expressions, e.g. +This eliminates the existence of the Point class. If Point had any methods, they could be inlined as specializations (cloned for each instance), and hopefully pruned after further inlining in expressions, e.g. {{{ @@ -398,15 +356,14 @@ Point bottomRight; public int area() { - return (bottomRight.getX() - topLeft.getX()) * (bottomRight.getY() - topLeft.getX()); + return (bottomRight.getX() - topLeft.getX()) * (bottomRight.getY() - topLeft.getY()); } } }}} -Consider inlining those methods the same as the fields, by prefixing them. Polymorphism would be problematic to handle, -requiring inlining the union of potential type's fields as a sort of discriminated union, and probably not worth the effort. -In general, it's probably best to avoid object inlining except where the concrete type is known and all methods are effectively -static. This roughly isomorphic to Overlay Types. +Consider inlining those methods the same as the fields, by prefixing them. Polymorphism would be problematic to handle, requiring inlining the union of potential type's fields as a sort of discriminated union, and probably not worth the effort. + +In general, it's probably best to avoid object inlining except where the concrete type is known and all methods are effectively static. This roughly isomorphic to Overlay Types. A complication arises if `Point` escapes scope, either leaks out of the class through a method, or leaks from local stack scope. @@ -422,9 +379,7 @@ } }}} -We can handle this in one of two ways: construct a `new Point(topLeft_x, topLeft_y)` and pass that value. Or, if this situation is -detected, back out of the inlining optimization. The latter is probably safer and would avoid pathological performance problems -(such as someone calling getTopLeft() in a loop) This leads to the following rule: +We can handle this in one of two ways: construct a `new Point(topLeft_x, topLeft_y)` and pass that value. Or, if this situation is detected, back out of the inlining optimization. The latter is probably safer and would avoid pathological performance problems (such as someone calling getTopLeft() in a loop) This leads to the following rule: _An object A may not be inlined into an Object B if any method exists on B which passes a reference to A or returns a reference to A_ @@ -445,13 +400,30 @@ }}} Which provides an exception to the above rule: -_An object A may not be inlined into an Object B if any method exists on B which passes a reference to A or returns a reference to A_, -except in cases where the reference may be inlined into the caller/callee. +_An object A may not be inlined into an Object B if any method exists on B which passes a reference to A or returns a reference to A_, except in cases where the reference may be inlined into the caller/callee. + +A reference could be inlined into the callee in cases where the object is inlined, and no other callers invoke the method with non-inlined versions. Consider: + +{{{ +public class PointUtil { + public static boolean equals(Point p1, Point p2) { + return p1.x == p2.x && p1.y == p2.y; + } +} +}}} + +If `equals` is never invoked from any callsite with a non-inlined version, then the method may be rewritten as: +{{{ +public class PointUtil { + public static boolean equals(int p1x, int p1y, int p2x, int p2y) { + return p1x == p2x && p1y == p2y; + } +} +}}} == Type Cloning == (via Bruce) -For every unique (lexical) occurrence of an allocation of type T, clone T into T' as if it were an independent type, but arrange that T' is indistinguishable from T w.r.t. casts, insta -nceof, and getClass(). In other words, you could never determine at runtime that the object wasn't actually a genuine T. Now, optimize as normal. To see how this could play out: +For every unique (lexical) occurrence of an allocation of type T, clone T into T' as if it were an independent type, but arrange that T' is indistinguishable from T w.r.t. casts, instanceof, and getClass(). In other words, you could never determine at runtime that the object wasn't actually a genuine T. Now, optimize as normal. To see how this could play out: {{{ class BasicList<E> { @@ -508,10 +480,63 @@ } } }}} -In the above, `b` is of type `BasicList'` (a clone of `BasicList` that can be independently optimized). Thus, you can see that `BasicList'` does not have the `add()` method called, whi -ch means that, for this particular use of the type, the `elems` field will always be `null`. Consequently, `size()` can be statically known to return `0`. Similarly, the iterator's `ha -sNext()` method can be statically shown to always return false in the for/each loop. - -The key point is that the usage of a type in one context would not affect how the type itself could be optimized in other contexts. As it stands today, we can't do the above optimizati -on if anywhere in the program someone calls `BasicList#add()`. A great practical example is Widgets firing events. You want widgets to be *able* to fire events *if* someone actually wa -nts to listen to them. In the cases where you can see that there are no listeners to a widget, you'd really like the entirely of the even-firing infrastructur to completely disappear. +In the above, `b` is of type `BasicList'` (a clone of `BasicList` that can be independently optimized). Thus, you can see that `BasicList'` does not have the `add()` method called, which means that, for this particular use of the type, the `elems` field will always be `null`. Consequently, `size()` can be statically known to return `0`. Similarly, the iterator's `hasNext()` method can be statically shown to always return false in the for/each loop. + +The key point is that the usage of a type in one context would not affect how the type itself could be optimized in other contexts. As it stands today, we can't do the above optimization if anywhere in the program someone calls `BasicList#add()`. A great practical example is Widgets firing events. You want widgets to be *able* to fire events *if* someone actually wants to listen to them. In the cases where you can see that there are no listeners to a widget, you'd really like the entirely of the even-firing infrastructur to completely disappear. + +== Semantic Techniques == + +Let us call a method semantic, if it is recognized by the compiler by name, and the compiler has a heuristic for evaluating the same function, discarding the original implementation. Semantic techniques are used by a number of compilers to replace operations with intrinsically faster ones. Consider the Java library function `Math.sin()`. While the library could theoretically have an implementation of sin() using table lookup or power series expansion, in some architectures, there is a native CPU instruction for computing `sin`, and therefore, the JIT compiler can in many cases, detect the call to Math.sin(), and replace it with an inlined instruction. Let us call this technique _semantic inlining_. + +Given the prior technique of _Type Cloning_, it may be possible to speed up JRE collections significantly. Let's see how. Consider the following code: + +{{{ +HashMap<String, String> map = new HashMap<String, String>(); +map.put("apples", "Ray"); +map.put("grapes", "Bruce"); +map.put("oranges", "Scott"); +for(String key : map.keySet()) { + Window.alert(map.get(key) + " likes " + key); +} +}}} + +What we would like to see as output is: + +{{{ +map = {}; +map['apples']="Ray"; +map['grapes']="Bruce"; +map['oranges']="Scott"; +for(var key in map) { + $wnd.alert(map[key] + ' likes ' + key); +} +}}} + +If we recognize that the cloned type T is HashMap<String, String>, and that the method entrySet(), values(), remove(), and others are never invoked, might it be possible to replace the implementation of HashMap<String,String> for T with a simpler, pre=optimized one based on the needs of a simple string dictionary lookup? + +Furthermore, the pattern `for(key : map.keySet())` is extremely common and has a corresponding efficient Javascript representation. Perhaps this can be recognized too and replaced with a JSNI method call/JSNI fragment. + +This applies equally well to `ArrayList`: +{{{ +ArrayList list = new ArrayList(); +list.add("foo"); +list.add("bar"); +list.add("baz"); +for(String word : list) { + Window.alert(word); +} +}}} + +which could be written in Javascript as: + +{{{ +list = []; +list[list.length] = "foo"; +list[list.length] = "bar"; +list[list.length] = "baz"; +for(var i=0; i<list.length; i++) { + $wnd.alert(list[i]); +} +}}} + +Since many of the methods of ArrayList are unused in this circumstance, and java.util.Iterator is never used directly, a semantically replaced/inline version could avoid JRE collections altogether. \ No newline at end of file --~--~---------~--~----~------------~-------~--~----~ http://groups.google.com/group/Google-Web-Toolkit-Contributors -~----------~----~----~----~------~----~------~--~---