Hey Graham,

Sorry for the delay, I meant to send this over the weekend.

On Thu, Jun 25, 2009 at 2:50 AM, Paul Biggar<paul.big...@gmail.com> wrote:
>
> When I say model, I mean that we "abstractly" "interpret" the program.
> Its like we're interpreting it, but we don't have the actual values,
> so we model all possible values at a certain point. We do this using 3
> combined analyses: constant propagation, alias analysis, and type
> inference. We also model definition of constants and the callgraph. We
> don't currently (but plan to in the future) model integer- and
> string-value-ranges, included files, or "truth analysis" (though that
> would only take about an hour, I think).

(If you're already familiar with how compiler optimization works, the
first few paragraphs won't be anything new to you. I'll gloss over the
simple compiler terms, please email me if googling hasnt worked for
you; I'll keep explaining the advanced concepts).

So the abstract interpretation then. We analyse a representation
called the "Medium-level Intermediate Representation" (MIR), which is
pretty close to PHP bytecode. We start by invoking the first method,
__MAIN__, which represents the global scope.


Upon entering a method, we convert it into a control-flow-graph (CFG),
which has one "basic block" per statement. We add the entry block to a
"worklist", which is processed roughly in FIFO order. After we process
a statement for the first time, we add its successors to the worklist.
In the case of loops, we may process the same statement again (or a
few times) -- in which case we only add the successor if the results
of the analysis have changed (aka we iterate until we reach a
"fixed-point").

This is all just like normal dataflow analysis. One twist we have is
that we use "conditional constant propagation". This means that if we
have something like:

if ($myvar) $x = 5 else $x = 6;

and we know that $myvar is 7, then we don't process the $x = 6, since
it can't occur.


At each "program point" (ie after each block) we store the complete
state of the analysis at this point. What exactly is stored depends on
the analysis. For constant-propagation, we store a "lattice" of
constants. That means that we know each variable is one of:
- TOP: the value is undefined
- BOTTOM: the variable may have multiple values
- or we know the exact constant that variable represents

With PHP, there is a slight difference from "normal" lattice based
analysis, in that TOP can represent NULL, since variables are
automatically initialized to NULL. When we find another value for $x,
we merge it with the current value. While (TOP + some value V => V),
(NULL + non-NULL => BOTTOM).

For type-inference, we use a set of type names for each variable. The
types are int, bool, unset, real, string, resource, array, and any
concrete classname. We don't have a BOTTOM (most conservative result),
because not knowing one type is the end of the analysis. In theory,
this means that in some unusual cases (like "new $classname.$num") the
analysis will only terminate by running of out memory. Oh well.

For constants (as in "define (MYCONST, 5);"), we store the value of
each constant so far in the program. We use the lattice approach from
constant-propagation, but unknown values only come up if you define
the same constant in different branches, which isn't too common I
think. We currently don't model these well enough to be able to
resolve a call to is_defined, but this wouldn't be too hard.

Finally, the really big and important analysis is alias analysis. I'll
go into some detail in a moment. We store at each program point a
complete points-to graph of the alias results, and use copy-on-write
to avoid a massive explosion in space and execution time (not very
successfully).


Alias analysis is the really important analysis. I mentioned before
that not knowing a single type will ruin the analysis. When I said
above that we stored a constant lattice and a set of types for each
variable, that wasn't exactly true. We actually store them for each
object, field, array, index, symbol table and variable. The alias
analysis allows us to tell the interaction between all of these.

You might notice that these three concepts are mostly the same: object
is-to field as array is-to index as symbol-table is-to variable. In
the Zend engine, these are all implemented using hashtables. In our
analysis, these are also all modelled the same way.

Our points-to graph is just a graph with 3 kinds of nodes:
   - "storage nodes" represent objects, arrays and symbol tables,
  - "index nodes" represent fields, variables and array indices,
  - "value nodes" represent scalar values of index nodes.

There are 3 kinds of edges
 - "field edges" go from a storage node to an index node: for S -> I,
the object S may have a field named I.
 - "reference edges" go from index nodes to index nodes: for I1 -> I2,
the field I1 references the field I2.
 - "value edges" go from index nodes to storage or value nodes: for I
-> S, one of I's values is S.

As with constants, we can't always resolve if a field must exist. If a
field might be NULL, we've modelled that the same as if its
uninitialized. Again, this isn't impossible, just not a priority.


To give a more concrete example, consider (we'll assume the function
is __MAIN__):

- $x = 5; -- create an index_node "x", pointed to by the local symbol
table "__MAIN__", and pointing to a value node whose value is "5,int"

- $a = $b; -- do a deep copy from "__MAIN__" -> "b" to "__MAIN__" ->
"a". This means take all the storage nodes and do a deep clone
(keeping track of cycles). We only do deep clones of arrays. Objects
are copied by reference, so we just add an edge to the existing
storage node representing that object. For scalars, we merge the
values in the type inference and constant-propagation, but the
points-to graphs are unaffected.

- $y =& $obj->fi; -- Find all the index nodes represented by $obj->fi.
First, find $obj, which is the index_node "obj" in the local symbol
table "__MAIN__". Then, for each of the storage nodes $obj points to,
find its index node "fi". If we're lucky, there will only be one such
node, in which case we create a DEFINITE reference edge between it and
"y". Otherwise, we create one POSSIBLE reference edge between the
"fi"s and "y". We then copy all the values to $y by adding an edge
from y to all the storage nodes of $obj->fi's. We don't add an edge to
$obj->fi's value node, instead, we merge its value into $y's value
node.


Actually, its a little bit more complicated than this, because of references. So

$x = 5

actually checks if $x references any of index_nodes, whether it can
overwrite them with 5 or must make the result BOTTOM, etc.

I've put up an example of a points-to graph at

https://www.cs.tcd.ie/~pbiggar/whirl_41.png

which represents:

http://code.google.com/p/phc/source/browse/branches/dataflow/test/subjects/benchmarks/php-benchmarks/benchcli/tests/phpWhirl/classes/WhirlMemory.php#45.

Grey boxes are symbol tables (function local ones have "=" in their
name), blue boxes are values (B indicated BOTTOM, {...} is the set of
values), circles are index nodes. There are no reference edges in this
graph, they're all points-to or value edges. This is only the very
start of that program.



OK, that's an executive overview of the analysis algorithm. There are
quite a few details, as you might imagine, so I think the easiest
thing is to pick on something and ask about it. Topics of interest
might be:

implementation details
context-sensitivity
flow-sensitivity
loops
recursion
eval
include()
"fake" nodes
method/function calls
modelling builtin functions
inheritance
SSA
def-use information
{string,int}-range analyses
optimizations

Feel free to literally go through these one at a time, but I'd rather
you directed the course rather than me (that'll make it harder for me
to gloss over something I'm unsure about).

Paul


-- 
Paul Biggar
paul.big...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to