On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:
You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code', you mean you can't mix them physically in the compiled executable for all currently extant cases. They aren't the same. I agree with that. That said, this doesn't preclude having CUDA-like behavior where small functions could be written that don't violate the constraints of GPU code and simultaneously has semantics that could be executed on the CPU, and where such small functions are then allowed to be called from both CPU and GPU code.

However this still has problems of the cpu having to generate CPU code from the contents of gpu{} code blocks, as the GPU is unable to allocate memory, so for example ,

gpu{
    auto resultGPU = dot(c, cGPU);
}

likely either won't work, or generates an array allocation in cpu code before the gpu block is otherwise ran.

I wouldn't be so negative with the 'won't work' bit, 'cuz frankly the 'or' you wrote there is semantically like what OpenCL and CUDA do anyway.

Also how does that dot product function know the correct index range to run on?, are we assuming it knows based on the length of a?, while the syntax,

c[] = a[] * b[];

is safe for this sort of call, a function is less safe todo this with, with function calls the range needs to be told to the function, and you would call this function without the gpu{} block as the function itself is marked.

auto resultGPU = dot$(0 .. returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

I think it was mentioned earlier that there should be, much like in OpenCL or CUDA, builtins or otherwise available symbols for getting the global identifier of each work-item, the work-group size, global size, etc.

Remember with gpu's you don't send instructions, you send whole programs, and the whole program must finish before you can move onto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPU while the GPU is executing. Perhaps by default the behavior could be helpful for sequencing global memory in the GPU with CPU operations, but it's not a necessary behavior (see OpenCL and it's, in my opinion, really nice queuing mechanism).

=== Another thing...

I'm with luminousone's suggestion for some manner of function attribute, to the tune of several metric tonnes of chimes. Wind chimes. I'm supporting this suggestion with at least a metric tonne of wind chimes.

*This* (and some small number of helpers), rather than straight-up dumping a new keyword and block type into the language. I really don't think D *needs* to have this any lower level than a library based solution, because it already has the tools to make it ridiculously more convenient than C/C++ (not necessarily as much as CUDA's totally separate program nvcc does, but a huge amount).

ex.


@kernel auto myFun(BufferT)(BufferT glbmem)
{
// brings in the kernel keywords and whatnot depending __FUNCTION__
  // (because mixins eval where they're mixed in)
  mixin KernelDefs;
// ^ and that's just about all the syntactic noise, the rest uses mixed-in // keywords and the glbmem object to define several expressions that // effectively record the operations to be performed into the return type

// assignment into global memory recovers the expression type in the glbmem.
  glbmem[glbid] += 4;

  // This assigns the *expression* glbmem[glbid] to val.
  auto val = glbmem[glbid];

// Ignoring that this has a data race, this exemplifies recapturing the
  // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
  glbmem[glbid+1] = val;

return glbmem; ///< I lied about the syntactic noise. This is the last bit.
}


Now if you want to, you can at runtime create an OpenCL-code string (for example) by passing a heavily metaprogrammed type in as BufferT. The call ends up looking like this:


auto promisedFutureResult = Gpu.call!myFun(buffer);


The kernel compilation (assuming OpenCL) is memoized, and the promisedFutureResult is some asynchronous object that implements concurrent programming's future (or something to that extent). For convenience, let's say that it blocks on any read other than some special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizable to even execute the code on the CPU, as the launching call ( Gpu.call!myFun(buffer) ) can, instead of using an expression-buffer, just pass a normal array in and have the proper result pop out given some interaction between the identifiers mixed in by KernelDefs and the launching caller (ex. using a loop).

With CTFE, this method *I think* can also generate the code at compile time given the proper kind of expression-type-recording-BufferT.

Again, though, this requires a significant amount of metaprogramming, heavy abuse of auto, and... did i mention a significant amount of metaprogramming? It's roughly the same method I used to embed OpenCL code in a C++ project of mine without writing a single line of OpenCL code, however, so I *know* it's doable, likely even moreso, in D.

Reply via email to