[PHP-DEV] Re: PHC Dataflow

2009-06-30 Thread Paul Biggar
Hi Graham,

Sorry for taking so long getting back to you. I'll reply to both your
mails in one.

On Mon, Jun 29, 2009 at 9:37 PM, Graham Kelly wrote:
> Anyway, I’ve been looking into building a fairly simple def-use system where
> I build a graph for each variable in the function local scope (each CV,
> TMP_VAR, and VAR).  What I’m doing is building a local graph for each

So you're right of course. Simple should be the order of the day. You
can very likely do a lot of things with a simple approach.

 What are the other possible variable types?


> variable in each basic block and then using the CFG information to connect
> these local graphs to from a function wide def-use graph for each variable.

Does the CFG represent exception information? I think nearly every
statement can throw an exception via set_error_handler? Note that some
statements do nothing due to errors.



> The def-use system seems like it should allow for some fairly straight
> forward work to be done with dead code elimination, constant propagation,
> general usage information, loop induction identification, etc, etc. I

In theory yes. I would be surprised if you were able to do
loop-invariant code elimination, but you'll probably realize some nice
results with what you named above. I expect you'll find speed-ups from
places where naive bytecode has been generated too. I found a few
places where we created weird constructs early in the compiler that
were very easy to clear up later, with good results.

What are you using loop induction identification for? Converting
for-loop into foreach?



> Where I eventually want to go with this is to try to lock down as many data
> types as possible during optimization. Of course the biggest problem
> pecl/optimizer has with this is not being able to look at the entire
> program’s structure and use that to infer function argument data types. But,
> I also feel that before such analysis could happen I need to have the
> function local data flow analysis down solid anyway and at that point I can
> worry about the unknown types of arguments issue. I also feel that data
> types alone wont really be enough for PHP and tracking possible ranges for
> variables might be useful (this way for one you could better lock down the
> data types by for example being able to determine if a ZEND_ADD opcode will
> ever overflow into a double or if you can safely infer that it will always
> be a long).

This seems out of scope of bytecode -> bytecode? How would you use
that information? Adding extra bytecodes? I very much doubt this will
be worth it, especially considering the effort involved in following
types around (don't forget that every function can return NULL, and
many can return False). I see few places where I know a variable is an
int.

However, if you do go down this route, you might try peeling off the
first iteration of a loop. That way the first iteration will have the
weird stuff, if any, and the later iterations will have more pristine
information. There's a paper on this for python:
http://portal.acm.org/citation.cfm?id=1534550


On Tue, Jun 30, 2009 at 8:51 PM, Graham Kelly wrote:
> Anyway, I guess I’ll elaborate a bit on what I’ve been doing with data flow
> (ish) stuff. Basically I am looking at building a def use graph for each
> variable (as I mentioned in my last email). The problems seem to come with
> loops and with var vars, evals, and includes (I also haven't worked out
> array indexes and object properties yet). The loops heave had me considering

I would say that for var-vars and evals and includes, just don't
optimize functions containing them. Evals are bad practice, includes
are probably bad practice within a function, and var-vars are unusual.
Of course, you might be interested if a lot of the code you're
optimizing is automatically generated, in which case:

evals/includes: you no longer know anything about anything (unless you
can see what's included/evaled, and parse it, etc)
var-vars: they can read from any variable, and write to any variable.
So "$$a = 5;" is a may-def of all variables, "print ($$a)" is a
may-use of all variables. The set of all variables include unknown
variables (say, if $a is "facebook.com"). If you have var-vars and
references in the same bytecode, give up on this function.



> trying to use SSA because of the niceness it brings and because I think the
> phi functions could simplify some of the junk with loops. Fro var vars,

SSA gives a really nice framework to work with. My own experience
though is that everything I want to do needs to be done in order to
build SSA form. I gave a talk before on the nastiness of trying to do
SSA in PHP: http://www.prog.uni-saarland.de/ssasem/talks/Paul.Biggar.pdf.
But I got SCCP and DCE to work well on SSA, if I ignored lots of edge
cases.

Basically, there is a small set of hidden operations, like setting
$php_errormsg. Once you spot when those happen, you're probably home
free.


> evals, and includes I was considering i

[PHP-DEV] Re: PHC Dataflow

2009-06-30 Thread Graham Kelly
Hey,

Anyway, I guess I'll elaborate a bit on what I've been doing with data flow 
(ish) stuff. Basically I am looking at building a def use graph for each 
variable (as I mentioned in my last email). The problems seem to come with 
loops and with var vars, evals, and includes (I also haven't worked out array 
indexes and object properties yet). The loops heave had me considering trying 
to use SSA because of the niceness it brings and because I think the phi 
functions could simplify some of the junk with loops. Fro var vars, evals, and 
includes I was considering inserting a fake def at each one of those. The fake 
def could be marked specially in order to allow dead code elimination the 
ability to know that these aren't real defs and maybe make some observations 
about that (or whatever). As for array/object stuff... I'm not extremely 
interested in trying to track this at the moment. For one I don't think doing 
so will yield many results with the limited scope pecl/optimizer has for 
analysis plus I have no plans for using that data at the moment. The only big 
advantage is it might help narrow down data types for other vars better.  The 
other big area I haven't looked at are references. The easiest approach I see 
here is to just duplicate the def-use graph for each reference from the point 
after the reference was created. This of course could lead to a lot of extra 
junk in code with lots of references. Plus references to specific array indexes 
etc might turn out to be a big pain.  Heh sorry for the open endedness here 
but... Any Ideas/comments on this approach?


On 6/29/09 11:04 AM, "Paul Biggar"  wrote:

Hey Graham,

Sorry for the delay, I meant to send this over the weekend.

On Thu, Jun 25, 2009 at 2:50 AM, Paul Biggar wrote:
>
> When I say model, I mean that we "abstractly" "interpret" the program.
> Its like we're interpreting it, but we don't have the actual values,
> so we model all possible values at a certain point. We do this using 3
> combined analyses: constant propagation, alias analysis, and type
> inference. We also model definition of constants and the callgraph. We
> don't currently (but plan to in the future) model integer- and
> string-value-ranges, included files, or "truth analysis" (though that
> would only take about an hour, I think).

(If you're already familiar with how compiler optimization works, the
first few paragraphs won't be anything new to you. I'll gloss over the
simple compiler terms, please email me if googling hasnt worked for
you; I'll keep explaining the advanced concepts).

So the abstract interpretation then. We analyse a representation
called the "Medium-level Intermediate Representation" (MIR), which is
pretty close to PHP bytecode. We start by invoking the first method,
__MAIN__, which represents the global scope.


Upon entering a method, we convert it into a control-flow-graph (CFG),
which has one "basic block" per statement. We add the entry block to a
"worklist", which is processed roughly in FIFO order. After we process
a statement for the first time, we add its successors to the worklist.
In the case of loops, we may process the same statement again (or a
few times) -- in which case we only add the successor if the results
of the analysis have changed (aka we iterate until we reach a
"fixed-point").

This is all just like normal dataflow analysis. One twist we have is
that we use "conditional constant propagation". This means that if we
have something like:

if ($myvar) $x = 5 else $x = 6;

and we know that $myvar is 7, then we don't process the $x = 6, since
it can't occur.


At each "program point" (ie after each block) we store the complete
state of the analysis at this point. What exactly is stored depends on
the analysis. For constant-propagation, we store a "lattice" of
constants. That means that we know each variable is one of:
- TOP: the value is undefined
- BOTTOM: the variable may have multiple values
- or we know the exact constant that variable represents

With PHP, there is a slight difference from "normal" lattice based
analysis, in that TOP can represent NULL, since variables are
automatically initialized to NULL. When we find another value for $x,
we merge it with the current value. While (TOP + some value V => V),
(NULL + non-NULL => BOTTOM).

For type-inference, we use a set of type names for each variable. The
types are int, bool, unset, real, string, resource, array, and any
concrete classname. We don't have a BOTTOM (most conservative result),
because not knowing one type is the end of the analysis. In theory,
this means that in some unusual cases (like "new $classname.$num") the
analysis will only terminate by running of out memory. Oh well.

For constants (as in "define (MYCONST, 5);"), we store the value of
each constant so far in the program. We use the lattice approach from
constant-propagation, but unknown values only come up if you define
the same constant in different branches, which isn't too common I
think. We c

[PHP-DEV] Re: PHC Dataflow

2009-06-29 Thread Graham Kelly
Hey Paul,

As always, great information!

Anyway, I've been looking into building a fairly simple def-use system where I 
build a graph for each variable in the function local scope (each CV, TMP_VAR, 
and VAR).  What I'm doing is building a local graph for each variable in each 
basic block and then using the CFG information to connect these local graphs to 
from a function wide def-use graph for each variable. The def-use system seems 
like it should allow for some fairly straight forward work to be done with dead 
code elimination, constant propagation, general usage information, loop 
induction identification, etc, etc. I figured this might be a bit easier to 
quickly devolve than something more in-depth that looks like it might be able 
to be adapted to handle some simple data type and constant propagation stuff.

Where I eventually want to go with this is to try to lock down as many data 
types as possible during optimization. Of course the biggest problem 
pecl/optimizer has with this is not being able to look at the entire program's 
structure and use that to infer function argument data types. But, I also feel 
that before such analysis could happen I need to have the function local data 
flow analysis down solid anyway and at that point I can worry about the unknown 
types of arguments issue. I also feel that data types alone wont really be 
enough for PHP and tracking possible ranges for variables might be useful (this 
way for one you could better lock down the data types by for example being able 
to determine if a ZEND_ADD opcode will ever overflow into a double or if you 
can safely infer that it will always be a long).

Thanks,
Graham

On 6/29/09 11:04 AM, "Paul Biggar"  wrote:

Hey Graham,

Sorry for the delay, I meant to send this over the weekend.

On Thu, Jun 25, 2009 at 2:50 AM, Paul Biggar wrote:
>
> When I say model, I mean that we "abstractly" "interpret" the program.
> Its like we're interpreting it, but we don't have the actual values,
> so we model all possible values at a certain point. We do this using 3
> combined analyses: constant propagation, alias analysis, and type
> inference. We also model definition of constants and the callgraph. We
> don't currently (but plan to in the future) model integer- and
> string-value-ranges, included files, or "truth analysis" (though that
> would only take about an hour, I think).

(If you're already familiar with how compiler optimization works, the
first few paragraphs won't be anything new to you. I'll gloss over the
simple compiler terms, please email me if googling hasnt worked for
you; I'll keep explaining the advanced concepts).

So the abstract interpretation then. We analyse a representation
called the "Medium-level Intermediate Representation" (MIR), which is
pretty close to PHP bytecode. We start by invoking the first method,
__MAIN__, which represents the global scope.


Upon entering a method, we convert it into a control-flow-graph (CFG),
which has one "basic block" per statement. We add the entry block to a
"worklist", which is processed roughly in FIFO order. After we process
a statement for the first time, we add its successors to the worklist.
In the case of loops, we may process the same statement again (or a
few times) -- in which case we only add the successor if the results
of the analysis have changed (aka we iterate until we reach a
"fixed-point").

This is all just like normal dataflow analysis. One twist we have is
that we use "conditional constant propagation". This means that if we
have something like:

if ($myvar) $x = 5 else $x = 6;

and we know that $myvar is 7, then we don't process the $x = 6, since
it can't occur.


At each "program point" (ie after each block) we store the complete
state of the analysis at this point. What exactly is stored depends on
the analysis. For constant-propagation, we store a "lattice" of
constants. That means that we know each variable is one of:
- TOP: the value is undefined
- BOTTOM: the variable may have multiple values
- or we know the exact constant that variable represents

With PHP, there is a slight difference from "normal" lattice based
analysis, in that TOP can represent NULL, since variables are
automatically initialized to NULL. When we find another value for $x,
we merge it with the current value. While (TOP + some value V => V),
(NULL + non-NULL => BOTTOM).

For type-inference, we use a set of type names for each variable. The
types are int, bool, unset, real, string, resource, array, and any
concrete classname. We don't have a BOTTOM (most conservative result),
because not knowing one type is the end of the analysis. In theory,
this means that in some unusual cases (like "new $classname.$num") the
analysis will only terminate by running of out memory. Oh well.

For constants (as in "define (MYCONST, 5);"), we store the value of
each constant so far in the program. We use the lattice approach from
constant-propagation, but unknown values only come up if you define

[PHP-DEV] Re: PHC Dataflow

2009-06-29 Thread Paul Biggar
Hey Graham,

Sorry for the delay, I meant to send this over the weekend.

On Thu, Jun 25, 2009 at 2:50 AM, Paul Biggar wrote:
>
> When I say model, I mean that we "abstractly" "interpret" the program.
> Its like we're interpreting it, but we don't have the actual values,
> so we model all possible values at a certain point. We do this using 3
> combined analyses: constant propagation, alias analysis, and type
> inference. We also model definition of constants and the callgraph. We
> don't currently (but plan to in the future) model integer- and
> string-value-ranges, included files, or "truth analysis" (though that
> would only take about an hour, I think).

(If you're already familiar with how compiler optimization works, the
first few paragraphs won't be anything new to you. I'll gloss over the
simple compiler terms, please email me if googling hasnt worked for
you; I'll keep explaining the advanced concepts).

So the abstract interpretation then. We analyse a representation
called the "Medium-level Intermediate Representation" (MIR), which is
pretty close to PHP bytecode. We start by invoking the first method,
__MAIN__, which represents the global scope.


Upon entering a method, we convert it into a control-flow-graph (CFG),
which has one "basic block" per statement. We add the entry block to a
"worklist", which is processed roughly in FIFO order. After we process
a statement for the first time, we add its successors to the worklist.
In the case of loops, we may process the same statement again (or a
few times) -- in which case we only add the successor if the results
of the analysis have changed (aka we iterate until we reach a
"fixed-point").

This is all just like normal dataflow analysis. One twist we have is
that we use "conditional constant propagation". This means that if we
have something like:

if ($myvar) $x = 5 else $x = 6;

and we know that $myvar is 7, then we don't process the $x = 6, since
it can't occur.


At each "program point" (ie after each block) we store the complete
state of the analysis at this point. What exactly is stored depends on
the analysis. For constant-propagation, we store a "lattice" of
constants. That means that we know each variable is one of:
- TOP: the value is undefined
- BOTTOM: the variable may have multiple values
- or we know the exact constant that variable represents

With PHP, there is a slight difference from "normal" lattice based
analysis, in that TOP can represent NULL, since variables are
automatically initialized to NULL. When we find another value for $x,
we merge it with the current value. While (TOP + some value V => V),
(NULL + non-NULL => BOTTOM).

For type-inference, we use a set of type names for each variable. The
types are int, bool, unset, real, string, resource, array, and any
concrete classname. We don't have a BOTTOM (most conservative result),
because not knowing one type is the end of the analysis. In theory,
this means that in some unusual cases (like "new $classname.$num") the
analysis will only terminate by running of out memory. Oh well.

For constants (as in "define (MYCONST, 5);"), we store the value of
each constant so far in the program. We use the lattice approach from
constant-propagation, but unknown values only come up if you define
the same constant in different branches, which isn't too common I
think. We currently don't model these well enough to be able to
resolve a call to is_defined, but this wouldn't be too hard.

Finally, the really big and important analysis is alias analysis. I'll
go into some detail in a moment. We store at each program point a
complete points-to graph of the alias results, and use copy-on-write
to avoid a massive explosion in space and execution time (not very
successfully).


Alias analysis is the really important analysis. I mentioned before
that not knowing a single type will ruin the analysis. When I said
above that we stored a constant lattice and a set of types for each
variable, that wasn't exactly true. We actually store them for each
object, field, array, index, symbol table and variable. The alias
analysis allows us to tell the interaction between all of these.

You might notice that these three concepts are mostly the same: object
is-to field as array is-to index as symbol-table is-to variable. In
the Zend engine, these are all implemented using hashtables. In our
analysis, these are also all modelled the same way.

Our points-to graph is just a graph with 3 kinds of nodes:
   - "storage nodes" represent objects, arrays and symbol tables,
  - "index nodes" represent fields, variables and array indices,
  - "value nodes" represent scalar values of index nodes.

There are 3 kinds of edges
 - "field edges" go from a storage node to an index node: for S -> I,
the object S may have a field named I.
 - "reference edges" go from index nodes to index nodes: for I1 -> I2,
the field I1 references the field I2.
 - "value edges" go from index nodes to storage or value nodes: for I
-> S, one of I'

Re: [PHP-DEV] Re: PHC Dataflow

2009-06-25 Thread Nuno Lopes

On Thu, Jun 25, 2009 at 9:08 PM, Nuno Lopes wrote:

So we do whole program analysis. We start at the first statement in
global scope, model it, then add all its successor to a queue, and
keep modelling until we get to the end of the program. When we get to
a method call, we start modelling the execution of the method. At the
end of the method, we propagate the results back to the caller, model
the return value, and carry on modelling the caller.


This is just an outline. I imagine this description can be applied to
a lot of compilers and a lot of analysis algorithms.


uhm, not many. Usually compilers consider each function independently, and 
do not consider the calling contexts.




it sounds like this algorithm:
Precise interprocedural dataflow analysis via graph reachability, POPL'95
http://portal.acm.org/citation.cfm?id=199462


I'm not familiar with this paper. However, from the abstract, it looks
like it wouldn't handle alias analysis (thats not to say it couldnt be
extended to do it). More importantly though, PHP's semantics are truly
special, so most analyses for "traditional" languages don't apply
well. Of course, many of the techniques do.


This paper describes a framework for dataflow analysis. It doesn't impose 
any such constraint. For example, I'm currently using this algorithm to 
build an analysis that takes aliasing into account.


Anyway, I didn't want to disturb this nice discussion :)  I just wanted to 
point you out this paper, which should be interesting for your thesis (to 
help you formalize your analysis).


Nuno 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: PHC Dataflow

2009-06-25 Thread Paul Biggar
Hi Nuno,

On Thu, Jun 25, 2009 at 9:08 PM, Nuno Lopes wrote:
>> So we do whole program analysis. We start at the first statement in
>> global scope, model it, then add all its successor to a queue, and
>> keep modelling until we get to the end of the program. When we get to
>> a method call, we start modelling the execution of the method. At the
>> end of the method, we propagate the results back to the caller, model
>> the return value, and carry on modelling the caller.

This is just an outline. I imagine this description can be applied to
a lot of compilers and a lot of analysis algorithms.


> it sounds like this algorithm:
> Precise interprocedural dataflow analysis via graph reachability, POPL'95
> http://portal.acm.org/citation.cfm?id=199462

I'm not familiar with this paper. However, from the abstract, it looks
like it wouldn't handle alias analysis (thats not to say it couldnt be
extended to do it). More importantly though, PHP's semantics are truly
special, so most analyses for "traditional" languages don't apply
well. Of course, many of the techniques do.

If you're looking for papers that describe the general approach, try
http://portal.acm.org/citation.cfm?id=178264 and
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.2637.

(FYI, people without ACM access can nearly always find those papers by
googling the title).

Paul



>
> Nuno
>



-- 
Paul Biggar
paul.big...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: PHC Dataflow

2009-06-25 Thread Nuno Lopes

So we do whole program analysis. We start at the first statement in
global scope, model it, then add all its successor to a queue, and
keep modelling until we get to the end of the program. When we get to
a method call, we start modelling the execution of the method. At the
end of the method, we propagate the results back to the caller, model
the return value, and carry on modelling the caller.


it sounds like this algorithm:
Precise interprocedural dataflow analysis via graph reachability, POPL'95
http://portal.acm.org/citation.cfm?id=199462

Nuno

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] Re: PHC Dataflow

2009-06-24 Thread Paul Biggar
Hi folks,

>> On Wed, Jun 24, 2009 at 11:42 PM, Graham Kelly
>> wrote:
>>> You mentioned a couple of weeks ago that you would be interested in
>>> further
>>> discussing dataflow for PHP. I briefly looked at the dataflow branch of
>>> PHC.
>>> It looks like you have been doing a LOT of work in this area. I would
>>> like
>>> to move the optimizer more towards this direction with the analysis and
>>> was
>>> wondering if maybe you could share some insights.

>> I would love a more in-depth description of the method you are using. As
>> for
>>
>> I have no problem with you CCing internals on it.

Following last week's start, this is a continuation of my discussion
with Graham Kelly about the design of the phc optimizer, so as to
better influence the PECL_optimizer. I've CCed PHP-internals since I
was asked to, and most people who would be interested are here. Please
let me know if it is too much noise, and we can take it elsewhere. I
would strongly welcome feedback.


So, I have a ton of information here. I'll try to organize it into a
few chunks, since there is way too much for one email. I'll start with
a quick intro and overview

The phc optimizer will form the bulk of my PhD thesis (coming soon, I
swear). I've spent about 2 years on it, not including extra time I
spent on phc itself. The source for the optimizer is online
(http://code.google.com/p/phc/source/browse/#svn/branches/dataflow/src/optimize),
and there is some other information about phc too
(http://phpcompiler.org/documentation.html). The state of the
optimizer is that the prototype is almost finished. It handles most
things. I plan to finish the prototype in about a week, write a paper,
then write my thesis.


The primary assumption I made is that the optimizer should be
whole-program. I'll justify that in a moment. This means we need to
see the source of all the code for a PHP program. This means it needs
to be a "deployment-time" optimizer, meaning after you've installed
the program, set the configuration variables, installed the plugins,
etc. A change in the source (ie addition of extra plugins) means the
optimizer will need to run again. Fiddling with .ini settings may
require a recompile, depending on the setting. I don't think this
hurts most people who would be serious about optimization (say, people
who run APC).

The main reason for this assumption is that we are unable to get any
results at all using intra-procedural analysis (intra-procedural means
you look at one function at a time). Consider:

function myfunc ($x, $y) { ... }

Since we know nothing about the types of $x and $y, there is little we
can say about the function. It may be $x and $y are references to each
other (via call-time-pass-by-ref), and that modifying one will modify
the other. It may be that $x is an object, and that echoing it will
call the __toString handler, possibly changing the value of $y, $this,
etc.

 There are a few more cases like this, but the hardest is the case
where $x has object_handlers. In this case, many simple operations
might trigger bizarre affects. As a worst-case example, the "read"
handler can be invoked by a read to $x, changing variables in the
active-symbol-table (local scope), setting each of them to have a
read-handler. In this way, unknown results can propagate to callers,
callees, globals, etc, making all analysis pretty much worthless. This
is an extreme case, which you could probably ignore, but the other
examples above couldn't be ignored.


So we do whole program analysis. We start at the first statement in
global scope, model it, then add all its successor to a queue, and
keep modelling until we get to the end of the program. When we get to
a method call, we start modelling the execution of the method. At the
end of the method, we propagate the results back to the caller, model
the return value, and carry on modelling the caller.

(The technical names for this is "abstract interpretation", or
"symbolic execution", though we're kinda abusing both terms. Really,
this is pretty similar to dataflow analysis).

When I say model, I mean that we "abstractly" "interpret" the program.
Its like we're interpreting it, but we don't have the actual values,
so we model all possible values at a certain point. We do this using 3
combined analyses: constant propagation, alias analysis, and type
inference. We also model definition of constants and the callgraph. We
don't currently (but plan to in the future) model integer- and
string-value-ranges, included files, or "truth analysis" (though that
would only take about an hour, I think).

I think that's probably enough for now. I'll talk about those 3
analyses tomorrow. Questions and feedback are welcome (especially if I
didn't explain clearly enough - academic stuff can be tough to explain
to a lay audience).

Thanks,
Paul




-- 
Paul Biggar
paul.big...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php