Hi Philippe,

thanks for the updated results.

 > Something else I forgot to mention. GEMM is not the only routine on the
> Earth ;) It may very well be the case that an optimized FFT would
> require a different padding size. It gets tricky. We might have to run
> the auto-tuner for a lot of different padding sizes...

Reasonable padding sizes are multiples of 16 or 32, so this shouldn't 
add too many additional candidates. Either way, the search space grows...

>     So, it seems like there is between 10 and 15% penalty (and much more
>     sometimes on AMD hardware) happening from not choosing the correct
>     padding size. On hawaii, this means that one will obtain (wowow,
>     exclusive report on ViennaCL 1.6's performance on Hawaii!) 3.8
>     TFLOP/s instead of 4.2 TFLOP/s, and I think that this difference is
>     significant enough to be worth being dealt with.
>
>     I'm not sure, however, how we should deal with this issue. Since
>     kernels are compiled at the context level and since we plan to use
>     one device per context, what would you think about handling the
>     matrix-padding size within the context instead of the matrix?

Since the matrix padding size is the property of a matrix and a matrix 
can in principle also wrap memory buffers which are user-provided, this 
should not be handled in the context. However, the matrix objects may 
ask their context (or some other oracle) for the best padding size 
whenever a memory (re)allocation is needed. Since the exact padding is 
not known until runtime, we will always have to dispatch for the best 
kernel whenever an operation is invoked. Most of these overheads can be 
effectively eliminated by setting up references to the program holding 
the correctly padded kernels when the matrix object is created or the 
memory layout changed.


>     I think we shouldn't expose it to the user, though, since the
>     kernels have to be entirely compatible with the padding size (and we
>     don't want the user to break everything!). What would you think
>     about, at context initialization, querying the optimal padding size
>     to the generator for the current device? If the context has multiple
>     devices with incompatible padding size, how to handle it? Display a
>     warning for low performance and use a crappy fallback kernel?

We just need a slightly more fine-grained way of compiling OpenCL 
programs. Right now we compile the same sources for all devices within a 
context. With a powerful generator in place, we simply use the ability 
to compile programs per device (see parameter 'device_list' in 
clBuildProgram: 
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clBuildProgram.html 
) and we're good. :-)

Best regards,
Karli


------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to