On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
You can't mix cpu and gpu code, they must be separate.
H'okay, let's be clear here. When you say 'mix CPU and GPU code',
you mean you can't mix them physically in the compiled executable
for all currently extant cases. They aren't the same. I agree
with that. That said, this doesn't preclude having CUDA-like
behavior where small functions could be written that don't
violate the constraints of GPU code and simultaneously has
semantics that could be executed on the CPU, and where such small
functions are then allowed to be called from both CPU and GPU
code.
However this still has problems of the cpu having to generate
CPU code from the contents of gpu{} code blocks, as the GPU is
unable to allocate memory, so for example ,
gpu{
auto resultGPU = dot(c, cGPU);
}
likely either won't work, or generates an array allocation in
cpu code before the gpu block is otherwise ran.
I wouldn't be so negative with the 'won't work' bit, 'cuz frankly
the 'or' you wrote there is semantically like what OpenCL and
CUDA do anyway.
Also how does that dot product function know the correct index
range to run on?, are we assuming it knows based on the length
of a?, while the syntax,
c[] = a[] * b[];
is safe for this sort of call, a function is less safe todo
this with, with function calls the range needs to be told to
the function, and you would call this function without the
gpu{} block as the function itself is marked.
auto resultGPU = dot$(0 ..
returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);
I think it was mentioned earlier that there should be, much like
in OpenCL or CUDA, builtins or otherwise available symbols for
getting the global identifier of each work-item, the work-group
size, global size, etc.
Remember with gpu's you don't send instructions, you send whole
programs, and the whole program must finish before you can move
onto the next cpu instruction.
I disagree with the assumption that the CPU must wait for the GPU
while the GPU is executing. Perhaps by default the behavior could
be helpful for sequencing global memory in the GPU with CPU
operations, but it's not a necessary behavior (see OpenCL and
it's, in my opinion, really nice queuing mechanism).
=== Another thing...
I'm with luminousone's suggestion for some manner of function
attribute, to the tune of several metric tonnes of chimes. Wind
chimes. I'm supporting this suggestion with at least a metric
tonne of wind chimes.
*This* (and some small number of helpers), rather than
straight-up dumping a new keyword and block type into the
language. I really don't think D *needs* to have this any lower
level than a library based solution, because it already has the
tools to make it ridiculously more convenient than C/C++ (not
necessarily as much as CUDA's totally separate program nvcc does,
but a huge amount).
ex.
@kernel auto myFun(BufferT)(BufferT glbmem)
{
// brings in the kernel keywords and whatnot depending
__FUNCTION__
// (because mixins eval where they're mixed in)
mixin KernelDefs;
// ^ and that's just about all the syntactic noise, the rest
uses mixed-in
// keywords and the glbmem object to define several
expressions that
// effectively record the operations to be performed into the
return type
// assignment into global memory recovers the expression type
in the glbmem.
glbmem[glbid] += 4;
// This assigns the *expression* glbmem[glbid] to val.
auto val = glbmem[glbid];
// Ignoring that this has a data race, this exemplifies
recapturing the
// expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
glbmem[glbid+1] = val;
return glbmem; ///< I lied about the syntactic noise. This is
the last bit.
}
Now if you want to, you can at runtime create an OpenCL-code
string (for example) by passing a heavily metaprogrammed type in
as BufferT. The call ends up looking like this:
auto promisedFutureResult = Gpu.call!myFun(buffer);
The kernel compilation (assuming OpenCL) is memoized, and the
promisedFutureResult is some asynchronous object that implements
concurrent programming's future (or something to that extent).
For convenience, let's say that it blocks on any read other than
some special poll/checking mechanism.
The constraints imposed on the kernel functions is generalizable
to even execute the code on the CPU, as the launching call (
Gpu.call!myFun(buffer) ) can, instead of using an
expression-buffer, just pass a normal array in and have the
proper result pop out given some interaction between the
identifiers mixed in by KernelDefs and the launching caller (ex.
using a loop).
With CTFE, this method *I think* can also generate the code at
compile time given the proper kind of
expression-type-recording-BufferT.
Again, though, this requires a significant amount of
metaprogramming, heavy abuse of auto, and... did i mention a
significant amount of metaprogramming? It's roughly the same
method I used to embed OpenCL code in a C++ project of mine
without writing a single line of OpenCL code, however, so I
*know* it's doable, likely even moreso, in D.