On Sunday, 18 August 2013 at 05:05:48 UTC, Atash wrote:
On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
You do have limited Atomics, but you don't really have any sort of complex messages, or anything like that.

I said 'point 11', not 'point 10'. You also dodged points 1 and 3...

Intel doesn't have a dog in this race, so their is no way to know what they plan on doing if anything at all.

http://software.intel.com/en-us/vcsource/tools/opencl-sdk
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

Just based on those, I'm pretty certain they 'have a dog in this race'. The dog happens to be running with MPI and OpenCL across a bridge made of PCIe.

The Xeon Phi is interesting in so far as taking generic programming to a more parallel environment. However it has some serious limitations that will heavily damage its potential performance.

AVX2 is completely the wrong path to go about improving performance in parallel computing, The SIMD nature of this instruction set means that scalar operations, or even just not being able to fill the giant 256/512bit register wastes huge chunks of this things peak theoretical performance, and if any rules apply to instruction pairing on this multi issue pipeline you have yet more potential for wasted cycles.

I haven't seen anything about intels, micro thread scheduler, or how these chips handle mass context switching natural of micro threaded environments, These two items make a huge difference in performance, comparing radeon VLIW5/4 to radeon GCN is a good example, most of the performance benefit of GCN is from easy of scheduling scalar pipelines over more complex pipes with instruction pairing rules etc.

Frankly Intel, has some cool stuff, but they have been caught with their pants down, they have depended on their large fab advantage to carry them over and got lazy.

We likely are watching AMD64 all over again.

The reason to point out HSA, is because it is really easy add support for, it is not a giant task like opencl would be. A few changes to the front end compiler is all that is needed, LLVM's backend does the rest.

H'okay. I can accept that.

OpenCL isn't just a library, it is a language extension, that is ran through a preprocessor that compiles the embedded __KERNEL and __DEVICE functions, into usable code, and then outputs .c/.cpp files for the c compiler to deal with.

But all those extra bits are part of the computing *environment*. Is there something wrong with requiring the proper environment for an executable?

A more objective question: which devices are you trying to target here?

A first, simply a different way of approaching std.parallel like functionality, with an eye gpgpu in the future when easy integration solutions popup(such as HSA).

Those are all platform specific, they change based on the whim and fancy of NVIDIA and AMD with each and every new chip released, The size and configuration of CUDA clusters, or compute clusters, or EU's, or whatever the hell x chip maker feels like using at the moment.

Long term this will all be managed by the underlying support software in the video drivers, and operating system kernel. Putting any effort into this is a waste of time.

Yes. And the only way to optimize around them is to *know them*, otherwise you're pinning the developer down the same way OpenMP does. Actually, even worse than the way OpenMP does - at least OpenMP lets you set some hints about how many threads you want.

It would be best to wait for a more generic software platform, to find out how this is handled by the next generation of micro threading tools.

The way openCL/CUDA work reminds me to much of someone setting up tomcat to have java code generate php that runs on their apache server, just because they can. I would rather tighter integration with the core language, then having a language in language.

void example( aggregate in float a[] ; key , in float b[], out
  float c[]) {
        c[key] = a[key] + b[key];
}

example(a,b,c);

in the function declaration you can think of the aggregate basically having the reserve order of the items in a foreach statement.

int a[100] = [ ... ];
int b[100];
foreach( v, k ; a ) { b = a[k]; }

int a[100] = [ ... ];
int b[100];

void example2( aggregate in float A[] ; k, out float B[] ) { B[k] = A[k]; }

example2(a,b);

Contextually solid. Read my response to the next bit.

I am pretty sure they are simply multiplying the index value by the unit size they desire to work on

int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float b[]){
  b[k]   = a[k];
  b[k+1] = a[k+1];
}

example3( 0 .. 50 , a,b);

Then likely they are simply executing multiple __KERNEL functions in sequence, would be my guess.

I've implemented this algorithm before in OpenCL already, and what you're saying so far doesn't rhyme with what's needed.

There are at least two ranges, one keeping track of partial summations, the other holding the partial sorts. Three separate kernels are ran in cycles to reduce over and scatter the data. The way early exit is implemented isn't mentioned as part of the implementation details, but my implementation of the strategy requires a third range to act as a flag array to be reduced over and read in between kernel invocations.

It isn't just unit size multiplication - there's communication between work-items and *exquisitely* arranged local-group reductions and scans (so-called 'warpscans') that take advantage of the widely accepted concept of a local group of work-items (a parameter you explicitly disregarded) and their shared memory pool. The entire point of the paper is that it's possible to come up with a general algorithm that can be parameterized to fit individual GPU configurations if desired. This kind of algorithm provides opportunities for tuning... which seem to be lost, unnecessarily, in what I've read so far in your descriptions.

My point being, I don't like where this is going by treating coprocessors, which have so far been very *very* different from one another, as the same batch of *whatever*. I also don't like that it's ignoring NVidia, and ignoring Intel's push for general-purpose accelerators such as their Xeon Phi.

But, meh, if HSA is so easy, then it's low-hanging fruit, so whatever, go ahead and push for it.

=== REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:

"A more objective question: which devices are you trying to target here?"

=== AND SOMETHING ELSE:

I feel like we're just on different wavelengths. At what level do you imagine having this support, in terms of support for doing low-level things? Is this something like OpenMP, where threading and such are done at a really (really really really...) high level, or what?

Low level optimization is a wonderful thing, But I almost wonder if this will always be something where in order todo the low level optimization you will be using the vendors provided platform for doing it, as no generic tool will be able to match the custom one.

Most of my interaction with the gpu is via shader programs for Opengl, I have only lightly used CUDA for some image processing software, So I am certainly not the one to give in depth detail to optimization strategies.

sorry on point 1, that was a typo, I meant

1. The range must be known prior to execution of a gpu code block.

as for

3. Code blocks can only receive a single range, it can however be multidimensional

int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float
  b[]){
  b[k]   = a[k];
}
example3( 0 .. 100 , a,b);

This function would be executed 100 times.

int a[10_000] = [ ... ];
int b[10_000];
void example3( aggregate in range r ; kx,aggregate in range r2 ; ky, in float a[], float
  b[]){
  b[kx+(ky*100)]   = a[kx+(ky*100)];
}
example3( 0 .. 100 , 0 .. 100 , a,b);

this function would be executed 10,000 times. the two aggregate ranges being treated as a single 2 dimensional range.

Maybe a better description of the rule would be that multiple ranges are multiplicative, and functionally operate as a single range.

Reply via email to