Oops,

Something else I forgot to mention. GEMM is not the only routine on the
Earth ;) It may very well be the case that an optimized FFT would require a
different padding size. It gets tricky. We might have to run the auto-tuner
for a lot of different padding sizes...

Philippe


2014-02-28 5:30 GMT+01:00 Philippe Tillet <phil.til...@gmail.com>:

> Hello everybody !
>
> So, it seems like padding sizes really do matter, and that imho making
> ViennaCL truly peformance-portable will require at some point
> hardware-adaptive padding size.
>
> For now, the generator only handles NonTrans x Trans multiplication, but I
> can already tell that :
>
> ----------------------------------------------------
> -> Fermi :
> SGEMM - Optimal padding size is 64 ,  about 10-15% performance loss if
> using a padding size of 48 or 96
> DGEMM - Have not tested it yet...
>
> -> Hawaii
> SGEMM - Optimal padding size is 96. About 10-15% performance loss if using
> a padding size of 64 or 128
> DGEMM - Optimal padding size is 48. About 10-15% loss if using a padding
> size of 64 or 128.
> More importantly, AMD GPUs usually have horrible performance on some sizes
> (4096, 4608, etc) on which bank conflicts happen and performance drop by 5x
> to 10x. Using a padding size of 48/96 not only gives better peak
> performance, but also allows to circumvent this weird issue.
> ---------------------------------------------------
>
> So, it seems like there is between 10 and 15% penalty (and much more
> sometimes on AMD hardware) happening from not choosing the correct padding
> size. On hawaii, this means that one will obtain (wowow, exclusive report
> on ViennaCL 1.6's performance on Hawaii!) 3.8 TFLOP/s instead of 4.2
> TFLOP/s, and I think that this difference is significant enough to be worth
> being dealt with.
>
> I'm not sure, however, how we should deal with this issue. Since kernels
> are compiled at the context level and since we plan to use one device per
> context, what would you think about handling the matrix-padding size within
> the context instead of the matrix?
>
> I think we shouldn't expose it to the user, though, since the kernels have
> to be entirely compatible with the padding size (and we don't want the user
> to break everything!). What would you think about, at context
> initialization, querying the optimal padding size to the generator for the
> current device? If the context has multiple devices with incompatible
> padding size, how to handle it? Display a warning for low performance and
> use a crappy fallback kernel?
>
> Best regards,
> Philippe
>
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to