Hi Phil, please don't drop the devel mailing list unless you mention some French secrets (maybe cheese or wine recipes?) which you are not allowed to share in public ;-)
> They switched from a VLIW-architecture to their GCN architecture within > the HD7xxx series: > http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units > > The HD7970 is thus one of the first using their GCN architecture, any > older AMD GPUs should behave more like the HD5850 you have. > > > Yes, I know... Actually, i think VLIW4 would work well, but having large > packs of 5 instructions makes it I think difficult to leverage its power > using OpenCL. But well, I may be mistaken. The subtle point I was trying to make it that VLIW4 is on the way of becoming a hardware dinosaur, so we can safely accept that OpenCL is presumably not the best programming model for it. > > I actually think registers are a very precious > > resource on AMD device. > > Indeed. Newer architectures have more registers, so the benefit of > carefully managing registers is less pronounced. > > > Well, I think we'll always have to be careful about the register usage. > Newer devices get more registers, but also more ALUs and more Processing > Elements. I feel like some forthcoming OpenCL devices (FPGAs, DSPs, > Embedded GPUs) may be equally sensitive to register pressure, though. Good point, with synthesized hardware every register/transistor counts. Too bad most of them use a single DRAM channel only and are terribly expensive... > I think that the AMD hardware is simply running short of registers or > cannot fetch the new instructions (i.e. the for-loop is more suitable > for an instruction cache). Maybe it can benefit from a partial unroll? > > > Unfortunately, I've tried #pragma unroll 2, 4, 8, 16, 32, but in each > single case it seems to harm performance! I should also try to figure > out a minimalist kernel that reproduces the issue and to post it on the > AMD dev forum. Does #pragma unroll 2 show the same behavior as manually putting in two statements in the for loop? In theory it should, but... ;-) > Have you tried to pack multiple kernels into the same program object? > This is usually much more efficient than compiling each kernel > separately. If you can pack ~4 kernels into the same OpenCL program, the > compilation times may already be lower than the execution times. > > > It is unfortunately impossible to do with the current API, since I > cannot have multiple GEMM profiles assigned to the same kernel_generator > object... Hmm, let's discuss this tomorrow. I'll also try to dig out my microbenchmark on compilation times so that we can better reason about potential savings. > > There are still quite a few things I still need to do, before talking > > about the autotuning procedure itself :) > > My parallel work on the scheduler, support for integer types, and the > shared BLAS-like library is progressing, so we may approach a release > during next week. It's important to have the generator stable by that > time, we can always improve performance later. Do you think this is > feasible? > > > > I hope so. I have no intention of optimizing the templates further, for > now. I still have some bugs as of now, but we can talk about this in > further detail tomorrow on IRC. These templates are awesome already :-) With some stress tests the remaining bugs should be eliminated soon. Let me know if you need additional machines to run on. Best regards, Karli ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite! It's a free troubleshooting tool designed for production. Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel