Hey everybody,

For a few days, I've been playing around with AMD's CodeXL, the HD5850 and
the generator/autotuner:


- First of all, I want to share something that made me completely crazy.
Avoid :
*vector += scalar*vector
*
in a compute bound context. After replacing the above by:
*vector.s0 += scalar*vector.s0*
*vector.s1 += scalar*vector.s1
**vector.s2 += scalar*vector.s2
**vector.s3 += scalar*vector.s3*
 performance jumped from 900GFLOP/s to 2.3TFLOP/s on the HD7970 (which is
of the same order of magnitude as the best existing kernel so far,
presented by Matsumoto et Al.). Only ~10% improvement on HD5850, though.
It seems like the AMD OpenCL compiler does not properly translate the first
operation. A more optimistic view is that it does a very good job at
translating the second one :)

---------------------------------

- I can make my HD5850 peak at ~920GFLOP/s, which around 45% of the
theoretical peak. Some people in the litterature managed to get 75% of a
HD5870 (they reach ~2TFLOP/s out of ~2.8TFLOP/s ), which is truly
impressive. They had to use some assembly-like language, though. This is
because HD5xxx use VLIW of 5 instructions and not 4. CodeXL shows
"ALUPacking = 80.59%", which is in my opinion a direct consequence of
packing instructions 4 by 4 instead of 5 by 5. It seems to me that this
problem has more to do with the OpenCL compiler than my code. Since the
autotuner can find a spot at 95% cache hit rate and 0% local memory
conflict, I assume that the problem in the kernel comes from the way the
ALU are used, rather than bandwidth issues.
Does anybody know if some other architectures use fancy VLIW length? AMD
KernelAnalyzer gives the ptx output for the HD5850, but I am not
experienced enough to understand anything to it.

---------------------------------

-Very weird behavior:
the initial kernel for C = A*B  was something like:

> __kernel void gemm(uint M, uint K, uint N, __global float4* A, __global
float4* B, __global float4*C){
>  uint Mv = M/4; //internal size of A, which is row-major
> uint Nv = N/4; same thing for B.
> //...
> //use Mv and Nv to compute addresses of A and B, rather than M and N.
> }

When replacing it by
> __kernel void gemm(uint M, uint K, uint N, __global float4* A, __global
float4* B, __global float4*C){
> //use inline M/4 and N/4 to compute addresses of A and B, rather than M
and N.
> }

I got ~10% performance improvement on HD5850, no modification on HD7970.
Don't ask me why. I actually think registers are a very precious resource
on AMD device. Since the computation of M/4 and N/4 appears pretty rarely,
it seems to me that it is usually a better choice in these cases to save a
register. Furthermore, since all these registers are probably taken in the
vector register pool, it may be that an uint occupies a whole 128bit wide
register. I am not sure, though.

---------------------------------


> for(unsigned int bs = 0 ; bs < 32 ; ++bs);

seems to be not unrolled by default. Adding #pragma unroll 32
improves performance on NVidia Hardware (almost double them), but kills
them by a factor of 10 on AMD Hardware, for the GEMM case, at least. I am
confused about it. More on this later if I find an answer to that mystery.
If not, i'll just have full #pragma unroll by default, and disable it on
AMD hardware.


===== ON THE AUTOTUNING PROCEDURE =====
=======================================

Well...

While OpenCL is guaranteed mostly thread-safe (except for clSetKernelArg,
which is thread-safe as long as we set arguments for different kernels in
parallel), I think, it seems like parallel compilations process serially. I
observed this behavior when compiling multiple programs in the same
context, but someone else observed it using different contexts, etc...
http://stackoverflow.com/questions/14544802/threading-opencl-compiling .
Since compilation is a bottleneck of the autotuner ( when the matrix-size
is 1024*1024 at least ... see more later), it seems to me that it would be
a good thing to do. In the end, I thought the simplest way to handle the
problem is to partition the search space, and pass a partition index as an
argument. That way, for a 4-way partitioning:
./blas3_tuning 0&
./blas3_tuning 1&
./blas3_tuning 2&
./blas3_tuning 3&
We may observe some speed up (since the above stack overflow link reports
that using fork() resolves the issue.).
Or maybe should we use fork internally? Does anyone know if "make -j 4"
uses fork or multi-threading? We could have for example some ./blas3_tuning
-j 4.
However, for big matrices sizes, the tuning time seems to be dominated by
the execution of the crappy kernels...

---------------------------------

There are still quite a few things I still need to do, before talking about
the autotuning procedure itself :)

Best regards,
Philippe
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to