Re: GPGPUs

Atash Fri, 16 Aug 2013 23:00:50 -0700

On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:

You can't mix cpu and gpu code, they must be separate.

H'okay, let's be clear here. When you say 'mix CPU and GPU code',you mean you can't mix them physically in the compiled executablefor all currently extant cases. They aren't the same. I agreewith that. That said, this doesn't preclude having CUDA-likebehavior where small functions could be written that don'tviolate the constraints of GPU code and simultaneously hassemantics that could be executed on the CPU, and where such smallfunctions are then allowed to be called from both CPU and GPUcode.

However this still has problems of the cpu having to generateCPU code from the contents of gpu{} code blocks, as the GPU isunable to allocate memory, so for example ,
gpu{
    auto resultGPU = dot(c, cGPU);
}
likely either won't work, or generates an array allocation incpu code before the gpu block is otherwise ran.

I wouldn't be so negative with the 'won't work' bit, 'cuz franklythe 'or' you wrote there is semantically like what OpenCL andCUDA do anyway.

Also how does that dot product function know the correct indexrange to run on?, are we assuming it knows based on the lengthof a?, while the syntax,
c[] = a[] * b[];
is safe for this sort of call, a function is less safe todothis with, with function calls the range needs to be told tothe function, and you would call this function without thegpu{} block as the function itself is marked.
auto resultGPU = dot$(0 ..returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);

I think it was mentioned earlier that there should be, much likein OpenCL or CUDA, builtins or otherwise available symbols forgetting the global identifier of each work-item, the work-groupsize, global size, etc.

Remember with gpu's you don't send instructions, you send wholeprograms, and the whole program must finish before you can moveonto the next cpu instruction.

I disagree with the assumption that the CPU must wait for the GPUwhile the GPU is executing. Perhaps by default the behavior couldbe helpful for sequencing global memory in the GPU with CPUoperations, but it's not a necessary behavior (see OpenCL andit's, in my opinion, really nice queuing mechanism).


=== Another thing...

I'm with luminousone's suggestion for some manner of functionattribute, to the tune of several metric tonnes of chimes. Windchimes. I'm supporting this suggestion with at least a metrictonne of wind chimes.

*This* (and some small number of helpers), rather thanstraight-up dumping a new keyword and block type into thelanguage. I really don't think D *needs* to have this any lowerlevel than a library based solution, because it already has thetools to make it ridiculously more convenient than C/C++ (notnecessarily as much as CUDA's totally separate program nvcc does,but a huge amount).


ex.


@kernel auto myFun(BufferT)(BufferT glbmem)
{

// brings in the kernel keywords and whatnot depending__FUNCTION__

  // (because mixins eval where they're mixed in)
  mixin KernelDefs;

// ^ and that's just about all the syntactic noise, the restuses mixed-in// keywords and the glbmem object to define severalexpressions that// effectively record the operations to be performed into thereturn type

// assignment into global memory recovers the expression typein the glbmem.

  glbmem[glbid] += 4;

  // This assigns the *expression* glbmem[glbid] to val.
  auto val = glbmem[glbid];

// Ignoring that this has a data race, this exemplifiesrecapturing the

  // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
  glbmem[glbid+1] = val;

return glbmem; ///< I lied about the syntactic noise. This isthe last bit.

Now if you want to, you can at runtime create an OpenCL-codestring (for example) by passing a heavily metaprogrammed type inas BufferT. The call ends up looking like this:



auto promisedFutureResult = Gpu.call!myFun(buffer);

The kernel compilation (assuming OpenCL) is memoized, and thepromisedFutureResult is some asynchronous object that implementsconcurrent programming's future (or something to that extent).For convenience, let's say that it blocks on any read other thansome special poll/checking mechanism.

The constraints imposed on the kernel functions is generalizableto even execute the code on the CPU, as the launching call (Gpu.call!myFun(buffer) ) can, instead of using anexpression-buffer, just pass a normal array in and have theproper result pop out given some interaction between theidentifiers mixed in by KernelDefs and the launching caller (ex.using a loop).

With CTFE, this method *I think* can also generate the code atcompile time given the proper kind ofexpression-type-recording-BufferT.

Again, though, this requires a significant amount ofmetaprogramming, heavy abuse of auto, and... did i mention asignificant amount of metaprogramming? It's roughly the samemethod I used to embed OpenCL code in a C++ project of minewithout writing a single line of OpenCL code, however, so I*know* it's doable, likely even moreso, in D.

Re: GPGPUs

Reply via email to