Hi guys,
I have several tests which have copied/pasted code like this:
static def calc(
p : Place, n : Int,
params : RemoteArray[Float]{home==p,rank==1},
result : RemoteArray[Float]{home==p,rank==1}) {
val blocks = p.isCUDA() ? 480 : 1;
val threads = 512;
finish async at (p) @CUDA @CUDADirectParams {
finish
for ([block] in 0..blocks-1) async {
clocked finish
for ([thread] in 0..threads-1) clocked async {
val tid = block * threads + thread;
val tids = blocks * threads;
for (var i:int = tid; i < n; i += tids) {
val d = params(i);
result(i) = d * d;;
}
}
}
}
}
... which works fine.
One of the tests call the "calc" function above more or less like this:
finish {
for (gpu in gpus.values()) async at (cpu) {
...
//--- First step : allocate device arrays
val gpuDatum = CUDAUtilities.makeRemoteArray[Float]
(gpu, len, (j:int) => cpuDatum(size/n * i + j));
val gpuResult = CUDAUtilities.makeRemoteArray[Float]
(gpu, len, (j:int) => 0.0 as Float);
//--- Second step : call kernel function
calc(gpu, len, gpuDatum, gpuResult);
...
}
This example "works" but does not offer any coordination between gpus
connected to "cpu" (=here).
Once I still have only one gpu at the moment, I defined
export X10RT_ACCELS=CUDA0,CUDA0,CUDA0,CUDA0
When I tried to create another example employing teams like KMeansCUDA
does, it got stuck because all gpus share the same parent (=here).
So, looks like (I guess) that Team is good for coordination between
different cpus, once the code is typically host code and not kernel code.
OK. Then I tried coordination using clocks in different ways.
In the example below I explicitly declare and employ a certain clock for
coordinating tasks.
finish async {
val c = Clock.make();
for (gpu in gpus.values()) async clocked (c) {
val i = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES;
val len = size/n + ( i+1==n ? size%n : 0 );
//--- First step : allocate device arrays
c.next();
val gpuDatum = ...
val gpuResult = ...
//--- Second step : call kernel function
c.next();
calc(gpu, len, gpuDatum, gpuResult);
...
}
}
Executing this example I've got the following message
X10RT: async 37 is not a CUDA kernel.
If I'm not wrong, this message comes from the kernel function, once this
message dissapears when I comment out the call to the kernel function.
So, looks like there's a certain relationship between the "finish" in
the host code and the "finish" in the kernel code.
In the documentation on Finish (pg 160) it says that a "finish" waits
for termination of all activities spawned by "S". I'm certainly confused
by the implications of this statement, so I tried to simplify the code
above, like this:
clocked finish {
for (gpu in gpus.values()) async {
val i = (gpu==cpu) ? cpu.id : gpu.id - Place.MAX_PLACES;
val len = size/n + ( i+1==n ? size%n : 0 );
//--- First step : allocate device arrays
next;
val gpuDatum = ...
val gpuResult = ...
//--- Second step : call kernel function
next;
calc(gpu, len, gpuDatum, gpuResult);
...
}
}
When I execute it, the result is absolutely the same:
X10RT: async 37 is not a CUDA kernel.
So, could you guys guide me about this?
1. Am I correct to think that I cannot employ Team when 2 or more GPUs
belong to the same place?
2. What is the relationship between a finish in the host code and a
finish in the kernel code? Or maybe this question should be on "clock"s
instead on "finish"es ?
2. Would you recommend a explicit clock in order to avoid conflict with
the clock in the kernel function?
Thanks a lot :)
--
Richard Gomes
M: +44(77)9955-6813
http://tinyurl.com/frgomes
twitter: frgomes
JQuantLib is a library for Quantitative Finance written in Java.
http://www.jquantlib.org/
twitter: jquantlib
------------------------------------------------------------------------------
This SF Dev2Dev email is sponsored by:
WikiLeaks The End of the Free Internet
http://p.sf.net/sfu/therealnews-com
_______________________________________________
X10-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/x10-users