On 02/01/2013 05:12 PM, Ralf Karrenberg wrote:
> 1. An old frontend of AMD creates LLVM IR from the .cl file.
> The code still contains calls to get_global_id(), barrier(), etc.
Same with pocl (the pocl-build script does this).
> 2. libWFV transforms the kernel such that it works on a vector of the
> next W consecutive local ids of dimension 0 (*).
> The code still contains calls to get_global_id(), barrier(), etc.
OK, so it unconditionally expands all scalar instructions in the
fuction to their vector counterparts.
> 3. a) If there are no barriers,
> we create a wrapper for the kernel which contains loops over the
> local ids (probably the way you also do it). Inside that loop
> nest, there's a single call to the old kernel that we inline.
We add the loop also but there is no special case for kernels
without barriers at this point. Such kernel is considered a kernel with
only one "parallel region". The beef is in the analysis which forms
these parallel regions over which the loops are created. The
basis of this is described in the OpenCL ASIP paper we published.
> b) If there are barriers,
> we create functions for every code block between barriers and call
> them from a "trampoline", a switch statement inside a while loop,
> where each case loops over the local ids and calls one of the
> "continuations", which returns the ID of the next case to execute.
> This was described in more detail in our CC'12 paper.
This approach seems a bit like the light weight thread or the
setjmp/longjump approach, AFAIU, AMD CPU OpenCL implementation (used to
use). It should be OK for vectorization but longer BB regions are
nicer in VLIW. And jumps/calls are nice to avoid.
Anyways, I have to read your paper with more thought, it has been on
my read list far too long. Unfortunately me (nor Carlos) haven't had time
to work on the OpenCL kernel compilation issues as much I've wanted in
the past years.
Anyways, the pocl's approach is to produce the parallel regions
statically as long as possible.
The problematic cases are barriers inside conditional blocks (including
loops). Carlos invented a solution to this problem called "barrier tail
replication" which handles also barriers inside conditional regions by
creating a sort of a "test case" replica of the region that decides
where the branch is directed. This stems from the OpenCL 1.2 specification
that declares that barriers should be encountered by all WIs or none.
There's a publication pending of this.
Something like this (Carlos please add/correct if you still remember).
Example with 2xWI WG:
BB1;
if (P) {
BB2;
barrier();
}
Then replicating/serializing for 2xWI:
BB1 P -> BB2 -> BB1' -> BB2' (WI1 took P so the rest of them have to)
BB1 !P -> BB1' (WI1 didn't so the rest do not either)
A tail replication pass ensures all barriers are reachable by
only one preceding barrier to make the parallel region formation
(regions between barriers) feasible.
> The code does not contain any barrier() calls anymore now.
> Note that with enabled WFV, one of the generated loops (*) has a loop
> increment of W instead of 1.
> 4. We replace calls to get_local_id() etc. by accesses to the induction
> variables (similar to pocl I guess).
Yes.
> (*) The dimension is up to some currently non-existent heuristic that could
> decide that dimension 1 or 2 would be better for vectorization.
My plan in the 'loopvec' approach was to leave this type of decisions
to the LLVM vectorizer's cost metrics. Just produce it nicely vectorizable
input to digest.
> Obviously, this is only possible if the number of local work items is larger
> than the vectorization factor W and a multiple of it, but this is the case for
> all relevant applications that I have seen so far.
>
> As you can see, most of the transformations are done by the OpenCL driver,
> only
> part 2) is where libWFV is used.
> So, I think it should be possible to just modify your current pipeline at two
> specific points: First, we have to invoke WFV on the kernel at a pretty early
I think this is all doable inside pocl-workgroup as its input is the
single kernel from Clang. Its opt command line performs the aggressive
inlining etc. and it still has those interesting function calls.
> stage (basically directly after the frontend). Then, when creating the local
> work item loops, make the one over dimension 0 increment by 4 every time, and
> inline the vectorized kernel instead of the original one (unless vectorization
> failed, of course ;) ).
OK, I have done a similar thing for our experimental processor
architecture based on the replication method. I do not produce vectors,
but assume each scalar instruction is executed in lockstep on
N lanes (the SIMD GPU-like approach).
--
--Pekka
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel