Hi,

> For a few days, I've been playing around with AMD's CodeXL, the HD5850
> and the generator/autotuner:
>
>
> - First of all, I want to share something that made me completely crazy.
> Avoid :
> *vector += scalar*vector
> *
> in a compute bound context. After replacing the above by:
> *vector.s0 += scalar*vector.s0*
> *vector.s1 += scalar*vector.s1
> **vector.s2 += scalar*vector.s2
> **vector.s3 += scalar*vector.s3*
> performance jumped from 900GFLOP/s to 2.3TFLOP/s on the HD7970 (which is
> of the same order of magnitude as the best existing kernel so far,
> presented by Matsumoto et Al.). Only ~10% improvement on HD5850, though.
> It seems like the AMD OpenCL compiler does not properly translate the
> first operation. A more optimistic view is that it does a very good job
> at translating the second one :)

Ooops - no praise for AMD's compiler writers ;-) Do you have the latest 
APP SDK and the latest driver version? If so, I suggest you post this on 
the AMD dev forum. I can't think of any reason why the 'unrolled' 
version should perform better than the vectorized one.


> - I can make my HD5850 peak at ~920GFLOP/s, which around 45% of the
> theoretical peak. Some people in the litterature managed to get 75% of a
> HD5870 (they reach ~2TFLOP/s out of ~2.8TFLOP/s ), which is truly
> impressive. They had to use some assembly-like language, though. This is
> because HD5xxx use VLIW of 5 instructions and not 4. CodeXL shows
> "ALUPacking = 80.59%", which is in my opinion a direct consequence of
> packing instructions 4 by 4 instead of 5 by 5. It seems to me that this
> problem has more to do with the OpenCL compiler than my code. Since the
> autotuner can find a spot at 95% cache hit rate and 0% local memory
> conflict, I assume that the problem in the kernel comes from the way the
> ALU are used, rather than bandwidth issues.
> Does anybody know if some other architectures use fancy VLIW length? AMD
> KernelAnalyzer gives the ptx output for the HD5850, but I am not
> experienced enough to understand anything to it.

They switched from a VLIW-architecture to their GCN architecture within 
the HD7xxx series:
http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units

The HD7970 is thus one of the first using their GCN architecture, any 
older AMD GPUs should behave more like the HD5850 you have.



> -Very weird behavior:
> the initial kernel for C = A*B  was something like:
>
>  > __kernel void gemm(uint M, uint K, uint N, __global float4* A,
> __global float4* B, __global float4*C){
>  >  uint Mv = M/4; //internal size of A, which is row-major
>  > uint Nv = N/4; same thing for B.
>  > //...
>  > //use Mv and Nv to compute addresses of A and B, rather than M and N.
>  > }
>
> When replacing it by
>  > __kernel void gemm(uint M, uint K, uint N, __global float4* A,
> __global float4* B, __global float4*C){
>  > //use inline M/4 and N/4 to compute addresses of A and B, rather than
> M and N.
>  > }
>
> I got ~10% performance improvement on HD5850, no modification on HD7970.
> Don't ask me why.

I assume that there is only little compiler optimization going on here. 
Named variables are likely to be assigned fixed registers, so if you 
keep M/4 and N/4 inline, you save two registers.

> I actually think registers are a very precious
> resource on AMD device.

Indeed. Newer architectures have more registers, so the benefit of 
carefully managing registers is less pronounced.

> Since the computation of M/4 and N/4 appears
> pretty rarely, it seems to me that it is usually a better choice in
> these cases to save a register. Furthermore, since all these registers
> are probably taken in the vector register pool, it may be that an uint
> occupies a whole 128bit wide register. I am not sure, though.

Sounds reasonable.


>  > for(unsigned int bs = 0 ; bs < 32 ; ++bs);
>
> seems to be not unrolled by default. Adding #pragma unroll 32
> improves performance on NVidia Hardware (almost double them), but kills
> them by a factor of 10 on AMD Hardware, for the GEMM case, at least. I
> am confused about it. More on this later if I find an answer to that
> mystery. If not, i'll just have full #pragma unroll by default, and
> disable it on AMD hardware.

My experience is the same: for-loops are usually not unrolled per default.

I think that the AMD hardware is simply running short of registers or 
cannot fetch the new instructions (i.e. the for-loop is more suitable 
for an instruction cache). Maybe it can benefit from a partial unroll?


> ===== ON THE AUTOTUNING PROCEDURE =====
> =======================================
>
> Well...
>
> While OpenCL is guaranteed mostly thread-safe (except for
> clSetKernelArg, which is thread-safe as long as we set arguments for
> different kernels in parallel), I think, it seems like parallel
> compilations process serially. I observed this behavior when compiling
> multiple programs in the same context, but someone else observed it
> using different contexts, etc...
> http://stackoverflow.com/questions/14544802/threading-opencl-compiling .
> Since compilation is a bottleneck of the autotuner ( when the
> matrix-size is 1024*1024 at least ... see more later), it seems to me
> that it would be a good thing to do.

Have you tried to pack multiple kernels into the same program object? 
This is usually much more efficient than compiling each kernel 
separately. If you can pack ~4 kernels into the same OpenCL program, the 
compilation times may already be lower than the execution times.


> In the end, I thought the simplest
> way to handle the problem is to partition the search space, and pass a
> partition index as an argument. That way, for a 4-way partitioning:
> ./blas3_tuning 0&
> ./blas3_tuning 1&
> ./blas3_tuning 2&
> ./blas3_tuning 3&
> We may observe some speed up (since the above stack overflow link
> reports that using fork() resolves the issue.).

I don't think that an autotuning process should use multiple processes. 
Some time ago I tried to run our test suite in parallel, resulting in a 
2-3x slow-down with the CUDA 4.2 SDK on a GTX 285. You might get all 
weird types of interferences biasing the timings, for example

  Process 1: Submit 10 runs of a fast kernel A, but get interrupted 
after the fifth submission
  Process 2: Submit 10 runs of a slow kernel B, no interruption
  Process 1: Complete submission of the remaining 5 kernels.

The time you would measure on process 1 is then the execution time for 
the ten runs of the fast kernel A, plus 10 spurious runs of a slow kernel B.


> There are still quite a few things I still need to do, before talking
> about the autotuning procedure itself :)

My parallel work on the scheduler, support for integer types, and the 
shared BLAS-like library is progressing, so we may approach a release 
during next week. It's important to have the generator stable by that 
time, we can always improve performance later. Do you think this is 
feasible?

Best regards,
Karli


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to