Hi Philippe, thanks for the updated results.
> Something else I forgot to mention. GEMM is not the only routine on the > Earth ;) It may very well be the case that an optimized FFT would > require a different padding size. It gets tricky. We might have to run > the auto-tuner for a lot of different padding sizes... Reasonable padding sizes are multiples of 16 or 32, so this shouldn't add too many additional candidates. Either way, the search space grows... > So, it seems like there is between 10 and 15% penalty (and much more > sometimes on AMD hardware) happening from not choosing the correct > padding size. On hawaii, this means that one will obtain (wowow, > exclusive report on ViennaCL 1.6's performance on Hawaii!) 3.8 > TFLOP/s instead of 4.2 TFLOP/s, and I think that this difference is > significant enough to be worth being dealt with. > > I'm not sure, however, how we should deal with this issue. Since > kernels are compiled at the context level and since we plan to use > one device per context, what would you think about handling the > matrix-padding size within the context instead of the matrix? Since the matrix padding size is the property of a matrix and a matrix can in principle also wrap memory buffers which are user-provided, this should not be handled in the context. However, the matrix objects may ask their context (or some other oracle) for the best padding size whenever a memory (re)allocation is needed. Since the exact padding is not known until runtime, we will always have to dispatch for the best kernel whenever an operation is invoked. Most of these overheads can be effectively eliminated by setting up references to the program holding the correctly padded kernels when the matrix object is created or the memory layout changed. > I think we shouldn't expose it to the user, though, since the > kernels have to be entirely compatible with the padding size (and we > don't want the user to break everything!). What would you think > about, at context initialization, querying the optimal padding size > to the generator for the current device? If the context has multiple > devices with incompatible padding size, how to handle it? Display a > warning for low performance and use a crappy fallback kernel? We just need a slightly more fine-grained way of compiling OpenCL programs. Right now we compile the same sources for all devices within a context. With a powerful generator in place, we simply use the ability to compile programs per device (see parameter 'device_list' in clBuildProgram: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clBuildProgram.html ) and we're good. :-) Best regards, Karli ------------------------------------------------------------------------------ Flow-based real-time traffic analytics software. Cisco certified tool. Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer Customize your own dashboards, set traffic alerts and generate reports. Network behavioral analysis & security monitoring. All-in-one tool. http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/viennacl-devel
