We basically have to follow these rules,

1. The range must be none prior to execution of a gpu code block
2. The range can not be changed during execution of a gpu code block 3. Code blocks can only receive a single range, it can however be multidimensional
4. index keys used in a code block are immutable
5. Code blocks can only use a single key(the gpu executes many instances in parallel each with their own unique key)
6. index's are always an unsigned integer type
7. openCL,CUDA have no access to global state
8. gpu code blocks can not allocate memory
9. gpu code blocks can not call cpu functions
10. atomics tho available on the gpu are many times slower then on the cpu 11. separate running instances of the same code block on the gpu can not have any interdependency on each other.

Now if we are talking about HSA, or other similar setup, then a few of those rules don't apply or become fuzzy.

HSA, does have limited access to global state, HSA can call cpu functions that are pure, and of course because in HSA the cpu and gpu share the same virtual address space most of memory is open for access.

HSA also manages memory, via the hMMU, and their is no need for gpu memory management functions, as that is managed by the operating system and video card drivers.

Basically, D would either need to opt out of legacy api's such as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, and generally have ugly as sin syntax; or D would have go the route of a full and safe gpu subset of features.

I don't think such a setup can be implemented as simply a library, as the GPU needs compiled source.

If D where to implement gpgpu features, I would actually suggest starting by simply adding a microthreading function syntax, for example...

void example( aggregate in float a[] ; key , in float b[], out float c[]) {
        c[key] = a[key] + b[key];
}

By adding an aggregate keyword to the function, we can assume the range simply using the length of a[] without adding an extra set of brackets or something similar.

This would make access to the gpu more generic, and more importantly, because llvm will support HSA, removes the needs for writing more complex support into dmd as openCL and CUDA would require, a few hints for the llvm backend would be enough to generate the dual bytecode ELF executables.

Reply via email to