Hi folks, >> On Wed, Jun 24, 2009 at 11:42 PM, Graham Kelly<grah...@facebook.com> >> wrote: >>> You mentioned a couple of weeks ago that you would be interested in >>> further >>> discussing dataflow for PHP. I briefly looked at the dataflow branch of >>> PHC. >>> It looks like you have been doing a LOT of work in this area. I would >>> like >>> to move the optimizer more towards this direction with the analysis and >>> was >>> wondering if maybe you could share some insights.
>> I would love a more in-depth description of the method you are using. As >> for >> >> I have no problem with you CCing internals on it. Following last week's start, this is a continuation of my discussion with Graham Kelly about the design of the phc optimizer, so as to better influence the PECL_optimizer. I've CCed PHP-internals since I was asked to, and most people who would be interested are here. Please let me know if it is too much noise, and we can take it elsewhere. I would strongly welcome feedback. So, I have a ton of information here. I'll try to organize it into a few chunks, since there is way too much for one email. I'll start with a quick intro and overview The phc optimizer will form the bulk of my PhD thesis (coming soon, I swear). I've spent about 2 years on it, not including extra time I spent on phc itself. The source for the optimizer is online (http://code.google.com/p/phc/source/browse/#svn/branches/dataflow/src/optimize), and there is some other information about phc too (http://phpcompiler.org/documentation.html). The state of the optimizer is that the prototype is almost finished. It handles most things. I plan to finish the prototype in about a week, write a paper, then write my thesis. The primary assumption I made is that the optimizer should be whole-program. I'll justify that in a moment. This means we need to see the source of all the code for a PHP program. This means it needs to be a "deployment-time" optimizer, meaning after you've installed the program, set the configuration variables, installed the plugins, etc. A change in the source (ie addition of extra plugins) means the optimizer will need to run again. Fiddling with .ini settings may require a recompile, depending on the setting. I don't think this hurts most people who would be serious about optimization (say, people who run APC). The main reason for this assumption is that we are unable to get any results at all using intra-procedural analysis (intra-procedural means you look at one function at a time). Consider: function myfunc ($x, $y) { ... } Since we know nothing about the types of $x and $y, there is little we can say about the function. It may be $x and $y are references to each other (via call-time-pass-by-ref), and that modifying one will modify the other. It may be that $x is an object, and that echoing it will call the __toString handler, possibly changing the value of $y, $this, etc. There are a few more cases like this, but the hardest is the case where $x has object_handlers. In this case, many simple operations might trigger bizarre affects. As a worst-case example, the "read" handler can be invoked by a read to $x, changing variables in the active-symbol-table (local scope), setting each of them to have a read-handler. In this way, unknown results can propagate to callers, callees, globals, etc, making all analysis pretty much worthless. This is an extreme case, which you could probably ignore, but the other examples above couldn't be ignored. So we do whole program analysis. We start at the first statement in global scope, model it, then add all its successor to a queue, and keep modelling until we get to the end of the program. When we get to a method call, we start modelling the execution of the method. At the end of the method, we propagate the results back to the caller, model the return value, and carry on modelling the caller. (The technical names for this is "abstract interpretation", or "symbolic execution", though we're kinda abusing both terms. Really, this is pretty similar to dataflow analysis). When I say model, I mean that we "abstractly" "interpret" the program. Its like we're interpreting it, but we don't have the actual values, so we model all possible values at a certain point. We do this using 3 combined analyses: constant propagation, alias analysis, and type inference. We also model definition of constants and the callgraph. We don't currently (but plan to in the future) model integer- and string-value-ranges, included files, or "truth analysis" (though that would only take about an hour, I think). I think that's probably enough for now. I'll talk about those 3 analyses tomorrow. Questions and feedback are welcome (especially if I didn't explain clearly enough - academic stuff can be tough to explain to a lay audience). Thanks, Paul -- Paul Biggar paul.big...@gmail.com -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php