Hey Graham, Sorry for the delay, I meant to send this over the weekend.
On Thu, Jun 25, 2009 at 2:50 AM, Paul Biggar<paul.big...@gmail.com> wrote: > > When I say model, I mean that we "abstractly" "interpret" the program. > Its like we're interpreting it, but we don't have the actual values, > so we model all possible values at a certain point. We do this using 3 > combined analyses: constant propagation, alias analysis, and type > inference. We also model definition of constants and the callgraph. We > don't currently (but plan to in the future) model integer- and > string-value-ranges, included files, or "truth analysis" (though that > would only take about an hour, I think). (If you're already familiar with how compiler optimization works, the first few paragraphs won't be anything new to you. I'll gloss over the simple compiler terms, please email me if googling hasnt worked for you; I'll keep explaining the advanced concepts). So the abstract interpretation then. We analyse a representation called the "Medium-level Intermediate Representation" (MIR), which is pretty close to PHP bytecode. We start by invoking the first method, __MAIN__, which represents the global scope. Upon entering a method, we convert it into a control-flow-graph (CFG), which has one "basic block" per statement. We add the entry block to a "worklist", which is processed roughly in FIFO order. After we process a statement for the first time, we add its successors to the worklist. In the case of loops, we may process the same statement again (or a few times) -- in which case we only add the successor if the results of the analysis have changed (aka we iterate until we reach a "fixed-point"). This is all just like normal dataflow analysis. One twist we have is that we use "conditional constant propagation". This means that if we have something like: if ($myvar) $x = 5 else $x = 6; and we know that $myvar is 7, then we don't process the $x = 6, since it can't occur. At each "program point" (ie after each block) we store the complete state of the analysis at this point. What exactly is stored depends on the analysis. For constant-propagation, we store a "lattice" of constants. That means that we know each variable is one of: - TOP: the value is undefined - BOTTOM: the variable may have multiple values - or we know the exact constant that variable represents With PHP, there is a slight difference from "normal" lattice based analysis, in that TOP can represent NULL, since variables are automatically initialized to NULL. When we find another value for $x, we merge it with the current value. While (TOP + some value V => V), (NULL + non-NULL => BOTTOM). For type-inference, we use a set of type names for each variable. The types are int, bool, unset, real, string, resource, array, and any concrete classname. We don't have a BOTTOM (most conservative result), because not knowing one type is the end of the analysis. In theory, this means that in some unusual cases (like "new $classname.$num") the analysis will only terminate by running of out memory. Oh well. For constants (as in "define (MYCONST, 5);"), we store the value of each constant so far in the program. We use the lattice approach from constant-propagation, but unknown values only come up if you define the same constant in different branches, which isn't too common I think. We currently don't model these well enough to be able to resolve a call to is_defined, but this wouldn't be too hard. Finally, the really big and important analysis is alias analysis. I'll go into some detail in a moment. We store at each program point a complete points-to graph of the alias results, and use copy-on-write to avoid a massive explosion in space and execution time (not very successfully). Alias analysis is the really important analysis. I mentioned before that not knowing a single type will ruin the analysis. When I said above that we stored a constant lattice and a set of types for each variable, that wasn't exactly true. We actually store them for each object, field, array, index, symbol table and variable. The alias analysis allows us to tell the interaction between all of these. You might notice that these three concepts are mostly the same: object is-to field as array is-to index as symbol-table is-to variable. In the Zend engine, these are all implemented using hashtables. In our analysis, these are also all modelled the same way. Our points-to graph is just a graph with 3 kinds of nodes: - "storage nodes" represent objects, arrays and symbol tables, - "index nodes" represent fields, variables and array indices, - "value nodes" represent scalar values of index nodes. There are 3 kinds of edges - "field edges" go from a storage node to an index node: for S -> I, the object S may have a field named I. - "reference edges" go from index nodes to index nodes: for I1 -> I2, the field I1 references the field I2. - "value edges" go from index nodes to storage or value nodes: for I -> S, one of I's values is S. As with constants, we can't always resolve if a field must exist. If a field might be NULL, we've modelled that the same as if its uninitialized. Again, this isn't impossible, just not a priority. To give a more concrete example, consider (we'll assume the function is __MAIN__): - $x = 5; -- create an index_node "x", pointed to by the local symbol table "__MAIN__", and pointing to a value node whose value is "5,int" - $a = $b; -- do a deep copy from "__MAIN__" -> "b" to "__MAIN__" -> "a". This means take all the storage nodes and do a deep clone (keeping track of cycles). We only do deep clones of arrays. Objects are copied by reference, so we just add an edge to the existing storage node representing that object. For scalars, we merge the values in the type inference and constant-propagation, but the points-to graphs are unaffected. - $y =& $obj->fi; -- Find all the index nodes represented by $obj->fi. First, find $obj, which is the index_node "obj" in the local symbol table "__MAIN__". Then, for each of the storage nodes $obj points to, find its index node "fi". If we're lucky, there will only be one such node, in which case we create a DEFINITE reference edge between it and "y". Otherwise, we create one POSSIBLE reference edge between the "fi"s and "y". We then copy all the values to $y by adding an edge from y to all the storage nodes of $obj->fi's. We don't add an edge to $obj->fi's value node, instead, we merge its value into $y's value node. Actually, its a little bit more complicated than this, because of references. So $x = 5 actually checks if $x references any of index_nodes, whether it can overwrite them with 5 or must make the result BOTTOM, etc. I've put up an example of a points-to graph at https://www.cs.tcd.ie/~pbiggar/whirl_41.png which represents: http://code.google.com/p/phc/source/browse/branches/dataflow/test/subjects/benchmarks/php-benchmarks/benchcli/tests/phpWhirl/classes/WhirlMemory.php#45. Grey boxes are symbol tables (function local ones have "=" in their name), blue boxes are values (B indicated BOTTOM, {...} is the set of values), circles are index nodes. There are no reference edges in this graph, they're all points-to or value edges. This is only the very start of that program. OK, that's an executive overview of the analysis algorithm. There are quite a few details, as you might imagine, so I think the easiest thing is to pick on something and ask about it. Topics of interest might be: implementation details context-sensitivity flow-sensitivity loops recursion eval include() "fake" nodes method/function calls modelling builtin functions inheritance SSA def-use information {string,int}-range analyses optimizations Feel free to literally go through these one at a time, but I'd rather you directed the course rather than me (that'll make it harder for me to gloss over something I'm unsure about). Paul -- Paul Biggar paul.big...@gmail.com -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php