Re: [ViennaCL-devel] On Autotuning GEMM

Karl Rupp Wed, 07 Aug 2013 14:18:28 -0700

Hi Phil,

please don't drop the devel mailing list unless you mention some French 
secrets (maybe cheese or wine recipes?) which you are not allowed to 
share in public ;-)



 >     They switched from a VLIW-architecture to their GCN architecture 
within
>     the HD7xxx series:
>     http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units
>
>     The HD7970 is thus one of the first using their GCN architecture, any
>     older AMD GPUs should behave more like the HD5850 you have.
>
>
> Yes, I know... Actually, i think VLIW4 would work well, but having large
> packs of 5 instructions makes it I think difficult to leverage its power
> using OpenCL. But well, I may be mistaken.

The subtle point I was trying to make it that VLIW4 is on the way of 
becoming a hardware dinosaur, so we can safely accept that OpenCL is 
presumably not the best programming model for it.


>      > I actually think registers are a very precious
>      > resource on AMD device.
>
>     Indeed. Newer architectures have more registers, so the benefit of
>     carefully managing registers is less pronounced.
>
>
> Well, I think we'll always have to be careful about the register usage.
> Newer devices get more registers, but also more ALUs and more Processing
> Elements. I feel like some forthcoming OpenCL devices (FPGAs, DSPs,
> Embedded GPUs) may be equally sensitive to register pressure, though.

Good point, with synthesized hardware every register/transistor counts. 
Too bad most of them use a single DRAM channel only and are terribly 
expensive...


>     I think that the AMD hardware is simply running short of registers or
>     cannot fetch the new instructions (i.e. the for-loop is more suitable
>     for an instruction cache). Maybe it can benefit from a partial unroll?
>
>
> Unfortunately, I've tried #pragma unroll 2, 4, 8, 16, 32, but in each
> single case it seems to harm performance! I should also try to figure
> out a minimalist kernel that reproduces the issue and to post it on the
> AMD  dev forum.

Does #pragma unroll 2 show the same behavior as manually putting in two 
statements in the for loop? In theory it should, but... ;-)


>     Have you tried to pack multiple kernels into the same program object?
>     This is usually much more efficient than compiling each kernel
>     separately. If you can pack ~4 kernels into the same OpenCL program, the
>     compilation times may already be lower than the execution times.
>
>
> It is unfortunately impossible to do with the current API, since I
> cannot have multiple GEMM profiles assigned to the same kernel_generator
> object...

Hmm, let's discuss this tomorrow. I'll also try to dig out my 
microbenchmark on compilation times so that we can better reason about 
potential savings.


>      > There are still quite a few things I still need to do, before talking
>      > about the autotuning procedure itself :)
>
>     My parallel work on the scheduler, support for integer types, and the
>     shared BLAS-like library is progressing, so we may approach a release
>     during next week. It's important to have the generator stable by that
>     time, we can always improve performance later. Do you think this is
>     feasible?
>
>
>
> I hope so. I have no intention of optimizing the templates further, for
> now. I still have some bugs as of now, but we can talk about this in
> further detail tomorrow on IRC.

These templates are awesome already :-)
With some stress tests the remaining bugs should be eliminated soon. Let 
me know if you need additional machines to run on.

Best regards,
Karli


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] On Autotuning GEMM

Reply via email to