Re: GPGPUs

luminousone Sat, 17 Aug 2013 23:25:52 -0700

On Sunday, 18 August 2013 at 05:05:48 UTC, Atash wrote:

On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
You do have limited Atomics, but you don't really have anysort of complex messages, or anything like that.
I said 'point 11', not 'point 10'. You also dodged points 1 and3...
Intel doesn't have a dog in this race, so their is no way toknow what they plan on doing if anything at all.
http://software.intel.com/en-us/vcsource/tools/opencl-sdk
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
Just based on those, I'm pretty certain they 'have a dog inthis race'. The dog happens to be running with MPI and OpenCLacross a bridge made of PCIe.

The Xeon Phi is interesting in so far as taking genericprogramming to a more parallel environment. However it has someserious limitations that will heavily damage its potentialperformance.

AVX2 is completely the wrong path to go about improvingperformance in parallel computing, The SIMD nature of thisinstruction set means that scalar operations, or even just notbeing able to fill the giant 256/512bit register wastes hugechunks of this things peak theoretical performance, and if anyrules apply to instruction pairing on this multi issue pipelineyou have yet more potential for wasted cycles.

I haven't seen anything about intels, micro thread scheduler, orhow these chips handle mass context switching natural of microthreaded environments, These two items make a huge difference inperformance, comparing radeon VLIW5/4 to radeon GCN is a goodexample, most of the performance benefit of GCN is from easy ofscheduling scalar pipelines over more complex pipes withinstruction pairing rules etc.

Frankly Intel, has some cool stuff, but they have been caughtwith their pants down, they have depended on their large fabadvantage to carry them over and got lazy.


We likely are watching AMD64 all over again.

The reason to point out HSA, is because it is really easy addsupport for, it is not a giant task like opencl would be. Afew changes to the front end compiler is all that is needed,LLVM's backend does the rest.
H'okay. I can accept that.
OpenCL isn't just a library, it is a language extension, thatis ran through a preprocessor that compiles the embedded__KERNEL and __DEVICE functions, into usable code, and thenoutputs .c/.cpp files for the c compiler to deal with.
But all those extra bits are part of the computing*environment*. Is there something wrong with requiring theproper environment for an executable?
A more objective question: which devices are you trying totarget here?

A first, simply a different way of approaching std.parallel likefunctionality, with an eye gpgpu in the future when easyintegration solutions popup(such as HSA).

Those are all platform specific, they change based on the whimand fancy of NVIDIA and AMD with each and every new chipreleased, The size and configuration of CUDA clusters, orcompute clusters, or EU's, or whatever the hell x chip makerfeels like using at the moment.
Long term this will all be managed by the underlying supportsoftware in the video drivers, and operating system kernel.Putting any effort into this is a waste of time.
Yes. And the only way to optimize around them is to *knowthem*, otherwise you're pinning the developer down the same wayOpenMP does. Actually, even worse than the way OpenMP does - atleast OpenMP lets you set some hints about how many threads youwant.

It would be best to wait for a more generic software platform, tofind out how this is handled by the next generation of microthreading tools.

The way openCL/CUDA work reminds me to much of someone setting uptomcat to have java code generate php that runs on their apacheserver, just because they can. I would rather tighter integrationwith the core language, then having a language in language.

void example( aggregate in float a[] ; key , in float b[], out
  float c[]) {
        c[key] = a[key] + b[key];
}

example(a,b,c);
in the function declaration you can think of the aggregatebasically having the reserve order of the items in a foreachstatement.
int a[100] = [ ... ];
int b[100];
foreach( v, k ; a ) { b = a[k]; }

int a[100] = [ ... ];
int b[100];
void example2( aggregate in float A[] ; k, out float B[] ) {B[k] = A[k]; }
example2(a,b);
Contextually solid. Read my response to the next bit.
I am pretty sure they are simply multiplying the index valueby the unit size they desire to work on
int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], floatb[]){
  b[k]   = a[k];
  b[k+1] = a[k+1];
}

example3( 0 .. 50 , a,b);
Then likely they are simply executing multiple __KERNELfunctions in sequence, would be my guess.
I've implemented this algorithm before in OpenCL already, andwhat you're saying so far doesn't rhyme with what's needed.
There are at least two ranges, one keeping track of partialsummations, the other holding the partial sorts. Three separatekernels are ran in cycles to reduce over and scatter the data.The way early exit is implemented isn't mentioned as part ofthe implementation details, but my implementation of thestrategy requires a third range to act as a flag array to bereduced over and read in between kernel invocations.
It isn't just unit size multiplication - there's communicationbetween work-items and *exquisitely* arranged local-groupreductions and scans (so-called 'warpscans') that takeadvantage of the widely accepted concept of a local group ofwork-items (a parameter you explicitly disregarded) and theirshared memory pool. The entire point of the paper is that it'spossible to come up with a general algorithm that can beparameterized to fit individual GPU configurations if desired.This kind of algorithm provides opportunities for tuning...which seem to be lost, unnecessarily, in what I've read so farin your descriptions.
My point being, I don't like where this is going by treatingcoprocessors, which have so far been very *very* different fromone another, as the same batch of *whatever*. I also don't likethat it's ignoring NVidia, and ignoring Intel's push forgeneral-purpose accelerators such as their Xeon Phi.
But, meh, if HSA is so easy, then it's low-hanging fruit, sowhatever, go ahead and push for it.
=== REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:
"A more objective question: which devices are you trying totarget here?"
=== AND SOMETHING ELSE:
I feel like we're just on different wavelengths. At what leveldo you imagine having this support, in terms of support fordoing low-level things? Is this something like OpenMP, wherethreading and such are done at a really (really reallyreally...) high level, or what?

Low level optimization is a wonderful thing, But I almost wonderif this will always be something where in order todo the lowlevel optimization you will be using the vendors providedplatform for doing it, as no generic tool will be able to matchthe custom one.

Most of my interaction with the gpu is via shader programs forOpengl, I have only lightly used CUDA for some image processingsoftware, So I am certainly not the one to give in depth detailto optimization strategies.


sorry on point 1, that was a typo, I meant

1. The range must be known prior to execution of a gpu code block.

as for

3. Code blocks can only receive a single range, it can however bemultidimensional


int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float
  b[]){
  b[k]   = a[k];
}
example3( 0 .. 100 , a,b);

This function would be executed 100 times.

int a[10_000] = [ ... ];
int b[10_000];

void example3( aggregate in range r ; kx,aggregate in range r2 ;ky, in float a[], float

  b[]){
  b[kx+(ky*100)]   = a[kx+(ky*100)];
}
example3( 0 .. 100 , 0 .. 100 , a,b);

this function would be executed 10,000 times. the two aggregateranges being treated as a single 2 dimensional range.

Maybe a better description of the rule would be that multipleranges are multiplicative, and functionally operate as a singlerange.

Re: GPGPUs

Reply via email to