[PHP-DEV] Re: PHC Dataflow

Paul Biggar Wed, 24 Jun 2009 18:51:59 -0700

Hi folks,

>> On Wed, Jun 24, 2009 at 11:42 PM, Graham Kelly<grah...@facebook.com>
>> wrote:
>>> You mentioned a couple of weeks ago that you would be interested in
>>> further
>>> discussing dataflow for PHP. I briefly looked at the dataflow branch of
>>> PHC.
>>> It looks like you have been doing a LOT of work in this area. I would
>>> like
>>> to move the optimizer more towards this direction with the analysis and
>>> was
>>> wondering if maybe you could share some insights.


>> I would love a more in-depth description of the method you are using. As
>> for
>>
>> I have no problem with you CCing internals on it.

Following last week's start, this is a continuation of my discussion
with Graham Kelly about the design of the phc optimizer, so as to
better influence the PECL_optimizer. I've CCed PHP-internals since I
was asked to, and most people who would be interested are here. Please
let me know if it is too much noise, and we can take it elsewhere. I
would strongly welcome feedback.


So, I have a ton of information here. I'll try to organize it into a
few chunks, since there is way too much for one email. I'll start with
a quick intro and overview

The phc optimizer will form the bulk of my PhD thesis (coming soon, I
swear). I've spent about 2 years on it, not including extra time I
spent on phc itself. The source for the optimizer is online
(http://code.google.com/p/phc/source/browse/#svn/branches/dataflow/src/optimize),
and there is some other information about phc too
(http://phpcompiler.org/documentation.html). The state of the
optimizer is that the prototype is almost finished. It handles most
things. I plan to finish the prototype in about a week, write a paper,
then write my thesis.


The primary assumption I made is that the optimizer should be
whole-program. I'll justify that in a moment. This means we need to
see the source of all the code for a PHP program. This means it needs
to be a "deployment-time" optimizer, meaning after you've installed
the program, set the configuration variables, installed the plugins,
etc. A change in the source (ie addition of extra plugins) means the
optimizer will need to run again. Fiddling with .ini settings may
require a recompile, depending on the setting. I don't think this
hurts most people who would be serious about optimization (say, people
who run APC).

The main reason for this assumption is that we are unable to get any
results at all using intra-procedural analysis (intra-procedural means
you look at one function at a time). Consider:

function myfunc ($x, $y) { ... }

Since we know nothing about the types of $x and $y, there is little we
can say about the function. It may be $x and $y are references to each
other (via call-time-pass-by-ref), and that modifying one will modify
the other. It may be that $x is an object, and that echoing it will
call the __toString handler, possibly changing the value of $y, $this,
etc.

 There are a few more cases like this, but the hardest is the case
where $x has object_handlers. In this case, many simple operations
might trigger bizarre affects. As a worst-case example, the "read"
handler can be invoked by a read to $x, changing variables in the
active-symbol-table (local scope), setting each of them to have a
read-handler. In this way, unknown results can propagate to callers,
callees, globals, etc, making all analysis pretty much worthless. This
is an extreme case, which you could probably ignore, but the other
examples above couldn't be ignored.


So we do whole program analysis. We start at the first statement in
global scope, model it, then add all its successor to a queue, and
keep modelling until we get to the end of the program. When we get to
a method call, we start modelling the execution of the method. At the
end of the method, we propagate the results back to the caller, model
the return value, and carry on modelling the caller.

(The technical names for this is "abstract interpretation", or
"symbolic execution", though we're kinda abusing both terms. Really,
this is pretty similar to dataflow analysis).

When I say model, I mean that we "abstractly" "interpret" the program.
Its like we're interpreting it, but we don't have the actual values,
so we model all possible values at a certain point. We do this using 3
combined analyses: constant propagation, alias analysis, and type
inference. We also model definition of constants and the callgraph. We
don't currently (but plan to in the future) model integer- and
string-value-ranges, included files, or "truth analysis" (though that
would only take about an hour, I think).

I think that's probably enough for now. I'll talk about those 3
analyses tomorrow. Questions and feedback are welcome (especially if I
didn't explain clearly enough - academic stuff can be tough to explain
to a lay audience).

Thanks,
Paul




-- 
Paul Biggar
paul.big...@gmail.com

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] Re: PHC Dataflow

Reply via email to