This is somewhat nice, but without using a real compiler, the result will still be just a toy, unless you employ hundreds of compiler experts working full time on the project.
For instance, Wikipedia lists the following loop optimizations: # loop interchange : These optimizations exchange inner loops with outer loops. When the loop variables index into an array, such a transformation can improve locality of reference, depending on the array's layout. This is also known as loop permutation. # loop splitting/loop peeling : Loop splitting attempts to simplify a loop or eliminate dependencies by breaking it into multiple loops which have the same bodies but iterate over different contiguous portions of the index range. A useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing that iteration separately before entering the loop. # loop fusion or loop combining : Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data. # loop fission or loop distribution : Loop fission attempts to break a loop into multiple loops over the same index range but each taking only a part of the loop's body. This can improve locality of reference, both of the data being accessed in the loop and the code in the loop's body. # loop unrolling: Duplicates the body of the loop multiple times, in order to decrease the number of times the loop condition is tested and the number of jumps, which may degrade performance by impairing the instruction pipeline. Completely unrolling a loop eliminates all overhead (except multiple instruction fetches & increased program load time), but requires that the number of iterations be known at compile time (except in the case of JIT compilers). Care must also be taken to ensure that multiple re-calculation of indexed variables is not a greater overhead than advancing pointers within the original loop. # loop unswitching : Unswitching moves a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional. # loop inversion : This technique changes a standard while loop into a do/while (a.k.a. repeat/until) loop wrapped in an if conditional, reducing the number of jumps by two, for cases when the loop is executed. Doing so duplicates the condition check (increasing the size of the code) but is more efficient because jumps usually cause a pipeline stall. Additionally, if the initial condition is known at compile-time and is known to be side-effect-free, the if guard can be skipped. # loop-invariant code motion : If a quantity is computed inside a loop during every iteration, and its value is the same for each iteration, it can vastly improve efficiency to hoist it outside the loop and compute its value just once before the loop begins. This is particularly important with the address-calculation expressions generated by loops over arrays. For correct implementation, this technique must be used with loop inversion, because not all code is safe to be hoisted outside the loop. # loop reversal : Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization which can help eliminate dependencies and thus enable other optimizations. Also, certain architectures utilise looping constructs at Assembly language level that count in a single direction only (e.g. decrement-jump-if-not-zero (DJNZ)). # loop tiling/loop blocking : Loop tiling reorganizes a loop to iterate over blocks of data sized to fit in the cache. # loop skewing : Loop skewing takes a nested loop iterating over a multidimensional array, where each iteration of the inner loop depends on previous iterations, and rearranges its array accesses so that the only dependencies are between iterations of the outer loop. Good luck doing all this on TGSI (especially if the developer does not have serious experience writing production compilers). Also, this does not mention all the other optimizations and analyses required to the above stuff well (likely other 10-20 things). Using a real compiler (e.g. LLVM, but also gcc or Open64), those optimizations are already implemented, or at least there is already a team of experienced compiler developers who are working full time to implement such optimizations, allowing you to then just turn them on without having to do any of the work yourself. Note all "X compiler is bad for VLIW or whatever GPU architecture" objections are irrelevant, since almost all optimizations are totally architecture independent. Also note that we should support OpenCL/compute shaders (already available for *3* years on e.g. nv50) and those *really* need a real compiler (as in, something developed for years by a team of compiler experts, and in wide use). For instance, nVidia uses Open64 to compile CUDA programs, and then feeds back the output (via PTX) to their ad-hoc code generator. Note that unlike Mesa/Gallium, nVidia actually had a working shader optimizer AND a large paid team, yet they still decided to at least partially use Open64. PathScale (who seems to mainly sell an Open64-based compiler for the HPC market) might do some of this work (with a particular focus on a CUDA replacement for nv50), but it's unclear whether this will turn out to generally useful (for all Gallium drivers, as opposed to nv50-only) or not. Also they plan to use Open64 and WHIRL, and it's unclear whether this is as well designed for embedding and easy to understand and customize like LLVM is (please expand of this you know about it) Really, the current code generation situation is totally _embarassing_ (and r300 is probably one of the best here, having its own compiler, and doesn't even have loops, so you can imagine how good the other drivers are), and ought to be fixed in a definitive fashion. This is obviously not achievable if Mesa/Gallium contributors are supposed to write the compiler optimization themselves, since clearly there is not even enough manpower to support a relatively up-to-date version of OpenGL or, say, to have drivers that can allocate and fence GPU memory in a sensible and fast way, or implement hierarchical Z buffers, or any of the other things expected from a decent driver, that the Mesa drivers don't do. In other words, state-of-the-art optimizing compilers are not something one can just pop up and write himself from scratch, unless he is interested and skilled at it, it is his main project AND he manages to attract, or pays, a community of compiler experts to work on it. Since LLVM already works well, has a community of compiler experts working on it, and is funded by companies such as Apple, there is no chance of attracting such a community, especially for something limited to the niche of compiling shaders. And yes, LLVM->TGSI->LLVM is not entirely trivial, but it is doable (obviously), and once you get past that initial hurdle, you get EVERYTHING FOR FREE. And the free work keeps coming with every commit to the llvm repository, and you only have to do the minimal work of updating for LLVM interface changes. So you can just do nothing and after a few months you notice that your driver is faster on very advanced games because a new LLVM automatically improved the quality of your shaders without you even knowing about it. Not to mention that we could then at some point just get rid of TGSI, use LLVM IR directly, and have each driver implement a normal backend if possible. The test for adequateness of a shader compiler is saying "yes, this code is really good: I can't easily come up with any way to improve it", looking at the generated code for any example you can find. Any ad-hoc compiler will most likely immediately fail such a test, for complex examples. So, for a GSoC project, I'd kind of suggest: (1) Adapt the gallivm/llvmpipe TGSI->LLVM converter to also generate AoS code (i.e. RGBA vectors as opposed to RRRR, GGGG, etc.) if possible or write one from scratch otherwise (2) Write a LLVM->TGSI backend, restricted to programs without any control flow (3) Make LLVM->TGSI always work (even with control flow and DDX/DDY) (4) Hook up all useful LLVM optimizations If there is still time/as followup (note that these are mostly complex things, at most one/two might be doable in the timeframe) (5) Do something about uniform-specific shader generation, and support automatically generating "pre-shaders" for the CPU (using the x86/x86-64 LLVM backends) for uniform-only computations (6) Enhance LLVM to provide any missing optimization with a significant impact (7) Convert existing drivers to LLVM backends, or have them expose more functionality to the TGSI backend via TGSI extensions (or currently unused features such as predicate support), and do driver-specific stuff (e.g. scalarization for scalar architectures) (8) Make sure shaders can be compiled using as large as possible a subset of plain C/C++, as well as OpenCL (using clang), and add OpenCL support to Mesa/Gallium (some of it already exists in external repositories) (9) Compare with fglrx and nVidia libGL,/cgc/nvopencc and improve whatever necessary to be equal or better than them (10) Talk with LLVM developers about good VLIW code generation for the Radeons and to a lesser extent nv30/nv40 that need it, and find out exactly what the problem is here, how it can be solved and who could do the work (11) Add Gallium support for nv10/nv20 and r100/r200 using the LLVM DAG instruction selector to code-generate a fixed pipeline (Stephane Marchesin tried this already, seems it is non-trivial but could be made to work partially, and probably enough to get the Xorg state tracker to work on all cards and get rid of all X drivers at some point). (12) Figure out if any other compilers (Open64, gcc, whatever) can be useful as backends for some drivers Maybe I should propose to do it myself though, if that is still possible, since everyone else seems afraid of it for some reason and it seems to me it is absolutely essential to have a chance of having usable (read: that don't look ridiculous compared to the proprietary ones) drivers, especially in the long run for DirectX 11-level and later games and software heavily using OpenCL/compute shaders and very complex tessellation/vertex/geometry/fragment shaders. ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Mesa3d-dev mailing list Mesa3d-dev@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mesa3d-dev