This is somewhat nice, but without using a real compiler, the result
will still be just a toy, unless you employ hundreds of compiler
experts working full time on the project.

For instance, Wikipedia lists the following loop optimizations:
# loop interchange : These optimizations exchange inner loops with
outer loops. When the loop variables index into an array, such a
transformation can improve locality of reference, depending on the
array's layout. This is also known as loop permutation.

# loop splitting/loop peeling : Loop splitting attempts to simplify a
loop or eliminate dependencies by breaking it into multiple loops
which have the same bodies but iterate over different contiguous
portions of the index range. A useful special case is loop peeling,
which can simplify a loop with a problematic first iteration by
performing that iteration separately before entering the loop.

# loop fusion or loop combining : Another technique which attempts to
reduce loop overhead. When two adjacent loops would iterate the same
number of times (whether or not that number is known at compile time),
their bodies can be combined as long as they make no reference to each
other's data.

# loop fission or loop distribution : Loop fission attempts to break a
loop into multiple loops over the same index range but each taking
only a part of the loop's body. This can improve locality of
reference, both of the data being accessed in the loop and the code in
the loop's body.

# loop unrolling: Duplicates the body of the loop multiple times, in
order to decrease the number of times the loop condition is tested and
the number of jumps, which may degrade performance by impairing the
instruction pipeline. Completely unrolling a loop eliminates all
overhead (except multiple instruction fetches & increased program load
time), but requires that the number of iterations be known at compile
time (except in the case of JIT compilers). Care must also be taken to
ensure that multiple re-calculation of indexed variables is not a
greater overhead than advancing pointers within the original loop.

# loop unswitching : Unswitching moves a conditional inside a loop
outside of it by duplicating the loop's body, and placing a version of
it inside each of the if and else clauses of the conditional.

# loop inversion : This technique changes a standard while loop into a
do/while (a.k.a. repeat/until) loop wrapped in an if conditional,
reducing the number of jumps by two, for cases when the loop is
executed. Doing so duplicates the condition check (increasing the size
of the code) but is more efficient because jumps usually cause a
pipeline stall. Additionally, if the initial condition is known at
compile-time and is known to be side-effect-free, the if guard can be

# loop-invariant code motion : If a quantity is computed inside a loop
during every iteration, and its value is the same for each iteration,
it can vastly improve efficiency to hoist it outside the loop and
compute its value just once before the loop begins. This is
particularly important with the address-calculation expressions
generated by loops over arrays. For correct implementation, this
technique must be used with loop inversion, because not all code is
safe to be hoisted outside the loop.

# loop reversal : Loop reversal reverses the order in which values are
assigned to the index variable. This is a subtle optimization which
can help eliminate dependencies and thus enable other optimizations.
Also, certain architectures utilise looping constructs at Assembly
language level that count in a single direction only (e.g.
decrement-jump-if-not-zero (DJNZ)).

# loop tiling/loop blocking : Loop tiling reorganizes a loop to
iterate over blocks of data sized to fit in the cache.

# loop skewing : Loop skewing takes a nested loop iterating over a
multidimensional array, where each iteration of the inner loop depends
on previous iterations, and rearranges its array accesses so that the
only dependencies are between iterations of the outer loop.

Good luck doing all this on TGSI (especially if the developer does not
have serious experience writing production compilers).

Also, this does not mention all the other optimizations and analyses
required to the above stuff well (likely other 10-20 things).

Using a real compiler (e.g. LLVM, but also gcc or Open64), those
optimizations are already implemented, or at least there is already a
team of experienced compiler developers who are working full time to
implement such optimizations, allowing you to then just turn them on
without having to do any of the work yourself.

Note all "X compiler is bad for VLIW or whatever GPU architecture"
objections are irrelevant, since almost all optimizations are totally
architecture independent.

Also note that we should support OpenCL/compute shaders (already
available for *3* years on e.g. nv50) and those *really* need a real
compiler (as in, something developed for years by a team of compiler
experts, and in wide use).
For instance, nVidia uses Open64 to compile CUDA programs, and then
feeds back the output (via PTX) to their ad-hoc code generator.

Note that unlike Mesa/Gallium, nVidia actually had a working shader
optimizer AND a large paid team, yet they still decided to at least
partially use Open64.

PathScale (who seems to mainly sell an Open64-based compiler for the
HPC market) might do some of this work (with a particular focus on a
CUDA replacement for nv50), but it's unclear whether this will turn
out to generally useful (for all Gallium drivers, as opposed to
nv50-only) or not.
Also they plan to use Open64 and WHIRL, and it's unclear whether this
is as well designed for embedding and easy to understand and customize
like LLVM is (please expand of this you know about it)

Really, the current code generation situation is totally _embarassing_
(and r300 is probably one of the best here, having its own compiler,
and doesn't even have loops, so you can imagine how good the other
drivers are), and ought to be fixed in a definitive fashion.

This is obviously not achievable if Mesa/Gallium contributors are
supposed to write the compiler optimization themselves, since clearly
there is not even enough manpower to support a relatively up-to-date
version of OpenGL or, say, to have drivers that can allocate and fence
GPU memory in a sensible and fast way, or implement hierarchical Z
buffers, or any of the other things expected from a decent driver,
that the Mesa drivers don't do.

In other words, state-of-the-art optimizing compilers are not
something one can just pop up and write himself from scratch, unless
he is interested and skilled at it, it is his main project AND he
manages to attract, or pays, a community of compiler experts to work
on it.

Since LLVM already works well, has a community of compiler experts
working on it, and is funded by companies such as Apple, there is no
chance of attracting such a community, especially for something
limited to the niche of compiling shaders.

And yes, LLVM->TGSI->LLVM is not entirely trivial, but it is doable
(obviously), and once you get past that initial hurdle, you get
And the free work keeps coming with every commit to the llvm
repository, and you only have to do the minimal work of updating for
LLVM interface changes.
So you can just do nothing and after a few months you notice that your
driver is faster on very advanced games because a new LLVM
automatically improved the quality of your shaders without you even
knowing about it.

Not to mention that we could then at some point just get rid of TGSI,
use LLVM IR directly, and have each driver implement a normal backend
if possible.

The test for adequateness of a shader compiler is saying "yes, this
code is really good: I can't easily come up with any way to improve
it", looking at the generated code for any example you can find.

Any ad-hoc compiler will most likely immediately fail such a test, for
complex examples.

So, for a GSoC project, I'd kind of suggest:
(1) Adapt the gallivm/llvmpipe TGSI->LLVM converter to also generate
AoS code (i.e. RGBA vectors as opposed to RRRR, GGGG, etc.) if
possible or write one from scratch otherwise
(2) Write a LLVM->TGSI backend, restricted to programs without any control flow
(3) Make LLVM->TGSI always work (even with control flow and DDX/DDY)
(4) Hook up all useful LLVM optimizations

If there is still time/as followup (note that these are mostly complex
things, at most one/two might be doable in the timeframe)
(5) Do something about uniform-specific shader generation, and support
automatically generating "pre-shaders" for the CPU (using the
x86/x86-64 LLVM backends) for uniform-only computations
(6) Enhance LLVM to provide any missing optimization with a significant impact
(7) Convert existing drivers to LLVM backends, or have them expose
more functionality to the TGSI backend via TGSI extensions (or
currently unused features such as predicate support), and do
driver-specific stuff (e.g. scalarization for scalar architectures)
(8) Make sure shaders can be compiled using as large as possible a
subset of plain C/C++, as well as OpenCL (using clang), and add OpenCL
support to Mesa/Gallium (some of it already exists in external
(9) Compare with fglrx and nVidia libGL,/cgc/nvopencc and improve
whatever necessary to be equal or better than them
(10) Talk with LLVM developers about good VLIW code generation for the
Radeons and to a lesser extent nv30/nv40 that need it, and find out
exactly what the problem is here, how it can be solved and who could
do the work
(11) Add Gallium support for nv10/nv20 and r100/r200 using the LLVM
DAG instruction selector to code-generate a fixed pipeline (Stephane
Marchesin tried this already, seems it is non-trivial but could be
made to work partially, and probably enough to get the Xorg state
tracker to work on all cards and get rid of all X drivers at some
(12) Figure out if any other compilers (Open64, gcc, whatever) can be
useful as backends for some drivers

Maybe I should propose to do it myself though, if that is still
possible, since everyone else seems afraid of it for some reason and
it seems to me it is absolutely essential to have a chance of having
usable (read: that don't look ridiculous compared to the proprietary
ones) drivers, especially in the long run for DirectX 11-level and
later games and software heavily using OpenCL/compute shaders and very
complex tessellation/vertex/geometry/fragment shaders.

Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
Mesa3d-dev mailing list

Reply via email to