Re: GPGPUs

luminousone Sat, 17 Aug 2013 21:00:46 -0700

On Sunday, 18 August 2013 at 01:43:33 UTC, Atash wrote:

Unified virtual address-space I can accept, fine. Ignoring thatit is, in fact, in a totally different address-space wherememory latency is *entirely different*, I'm far *far* more iffyabout.
We basically have to follow these rules,
1. The range must be none prior to execution of a gpu codeblock2. The range can not be changed during execution of a gpu codeblock3. Code blocks can only receive a single range, it can howeverbe multidimensional
4. index keys used in a code block are immutable
5. Code blocks can only use a single key(the gpu executes manyinstances in parallel each with their own unique key)
6. index's are always an unsigned integer type
7. openCL,CUDA have no access to global state
8. gpu code blocks can not allocate memory
9. gpu code blocks can not call cpu functions
10. atomics tho available on the gpu are many times slowerthen on the cpu11. separate running instances of the same code block on thegpu can not have any interdependency on each other.
Please explain point 1 (specifically the use of the word'none'), and why you added in point 3?
Additionally, point 11 doesn't make any sense to me. There isresearch out there showing how to use cooperative warp-scans,for example, to have multiple work-items cooperate over somelocal block of memory and perform sorting in blocks. There areeven tutorials out there for OpenCL and CUDA that shows how todo this, specifically to create better performing code. Thisstatement is in direct contradiction with what exists.

You do have limited Atomics, but you don't really have any sortof complex messages, or anything like that.

Now if we are talking about HSA, or other similar setup, thena few of those rules don't apply or become fuzzy.
HSA, does have limited access to global state, HSA can callcpu functions that are pure, and of course because in HSA thecpu and gpu share the same virtual address space most ofmemory is open for access.
HSA also manages memory, via the hMMU, and their is no needfor gpu memory management functions, as that is managed by theoperating system and video card drivers.
Good for HSA. Now why are we latching onto this particularconstruction that, as far as I can tell, is missing the supportof at least two highly relevant giants (Intel and NVidia)?

Intel doesn't have a dog in this race, so their is no way to knowwhat they plan on doing if anything at all.

The reason to point out HSA, is because it is really easy addsupport for, it is not a giant task like opencl would be. A fewchanges to the front end compiler is all that is needed, LLVM'sbackend does the rest.

Basically, D would either need to opt out of legacy api's suchas openCL, CUDA, etc, these are mostly tied to c/c++ anyway,and generally have ugly as sin syntax; or D would have go theroute of a full and safe gpu subset of features.
Wrappers do a lot to change the appearance of a program. RawOpenCL may look ugly, but so do BLAS and LAPACK routines. Theuse of wrappers and expression templates does a lot to clean upcode (ex. look at the way Eigen 3 or any other linear algebralibrary does expression templates in C++; something D can doeven better).
I don't think such a setup can be implemented as simply alibrary, as the GPU needs compiled source.
This doesn't make sense. Your claim is contingent on opting outof OpenCL or any other mechanism that provides for theapplication to carry abstract instructions which are thencompiled on the fly. If you're okay with creating kernel codeon the fly, this can be implemented as a library, beyond anyreasonable doubt.

OpenCL isn't just a library, it is a language extension, that isran through a preprocessor that compiles the embedded __KERNELand __DEVICE functions, into usable code, and then outputs.c/.cpp files for the c compiler to deal with.

If D where to implement gpgpu features, I would actuallysuggest starting by simply adding a microthreading functionsyntax, for example...
void example( aggregate in float a[] ; key , in float b[], outfloat c[]) {
        c[key] = a[key] + b[key];
}
By adding an aggregate keyword to the function, we can assumethe range simply using the length of a[] without adding anextra set of brackets or something similar.
This would make access to the gpu more generic, and moreimportantly, because llvm will support HSA, removes the needsfor writing more complex support into dmd as openCL and CUDAwould require, a few hints for the llvm backend would beenough to generate the dual bytecode ELF executables.
1) If you wanted to have that 'key' nonsense in there, I'mthinking you'd need to add several additional parameters:global size, group size, group count, and maybe group-localmemory access (requires allowing multiple aggregates?). I mean,I get the gist of what you're saying, this isn't me pointingout a problem, just trying to get a clarification on it (maybegive 'key' some additional structure, or something).

Those are all platform specific, they change based on the whimand fancy of NVIDIA and AMD with each and every new chipreleased, The size and configuration of CUDA clusters, or computeclusters, or EU's, or whatever the hell x chip maker feels likeusing at the moment.

Long term this will all be managed by the underlying supportsoftware in the video drivers, and operating system kernel.Putting any effort into this is a waste of time.

2) ... I kind of like this idea. I disagree with how you led upto it, but I like the idea.
3) How do you envision *calling* microthreaded code? Just theusual syntax?


void example( aggregate in float a[] ; key , in float b[], out
   float c[]) {
        c[key] = a[key] + b[key];
}

example(a,b,c);

in the function declaration you can think of the aggregatebasically having the reserve order of the items in a foreachstatement.


int a[100] = [ ... ];
int b[100];
foreach( v, k ; a ) { b = a[k]; }

int a[100] = [ ... ];
int b[100];

void example2( aggregate in float A[] ; k, out float B[] ) { B[k]= A[k]; }


example2(a,b);

4) How would this handle working on subranges?
ex. Let's say I'm coding up a radix sort using something likethis:
https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0
What's the high-level program organization with this syntax ifwe can only use one range at a time? How many work-items getfired off? What's the gpu-code launch procedure?

I am pretty sure they are simply multiplying the index value bythe unit size they desire to work on


int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float b[]){
   b[k]   = a[k];
   b[k+1] = a[k+1];
}

example3( 0 .. 50 , a,b);

Then likely they are simply executing multiple __KERNEL functionsin sequence, would be my guess.

Re: GPGPUs

Reply via email to