Re: [ViennaCL-devel] Clarifications about the QR Factorization paper

Karl Rupp Wed, 03 Dec 2014 08:36:04 -0800

Hi Aanchan,

 > I went through the paper you sent me that talks about sub-vectors and
> sub-matrices for the Householder reflections by the use of what you call
> proxy objects.
>
> I had some questions that stem from reading that paper.
>
> 1. About the back-end: The paper talks about OpenCL as the back-end. As
> I understand OpenCL uses a Just-in-Time compiler, for an equivalent
> OpenCL kernel to be generated on the fly for each high-level expression
> with a ViennaCL type. What about the case of CUDA which uses the nvcc
> compiler more like the g++ compiler, how does that get translated for
> expressions containing ViennaCL types?


By the time the paper was written ViennaCL provided only an OpenCL backend.

You are right that OpenCL uses a just-in-time compiler. Since the 
overhead of launching the jit-compiler is high, we use predefined sets 
of operations (similar to BLAS) internally grouped into different 
modules, which get compiled on first access. For example, all operations 
on a compressed_matrix constitute one such module. Similarly, vector 
operations constitute one such module, etc.
Since these operations are known a-priori, we can provide the same 
routines implemented in CUDA going through nvcc and dispatch as needed. 
For 'complicated' operations such as the vector operation
  x = y + z - x + y - z;
there are temporary objects introduced such that the 'complicated' 
operations is decomposed into smaller operations supported by the 
backend. This code path is again the same for all three backends.


> 2. I am assuming the use of temporary variables and swapping between the
> CPU and GPU should be minimized during development for memory and speed
> reasons. Also one should probably be careful to understand how often the
> JIT OpenCL kernel generation is launched.

Yes, reducing temporaries and host-device communication is essential. 
Our current jit handling policy is a (imho fairly good) compromise of 
execution performance and minimizing jit-overhead.


> 3. I did also read the Halko-Tropp(http://arxiv.org/pdf/0909.4061.pdf)
> SVD paper on the Rank-k SVD approximations using a random projection.
> Their chief argument is that their algorithm for rank-k SVD
> approximation of an m by n matrix is O(mn*log(k)) compared to the usual
> O(mnk). I am assuming that is the reason you pointed me to that work. I
> still am to run profiling on the current SVD implementation, which I
> guess could be slow for a number of reasons, including the anything in
> the code that might cause a PCI-e transfer.

Our current SVD implementation is not optimal with respect to balancing 
the sequential computations run on the host with the massively parallel 
computations run on the device (i.e. matrix-matrix products). Similar to 
the case of the QR factorization, one can adaptively choose the panel 
size such that the computations on the host are in sync with the 
offloaded computations on the device - such optimization options haven't 
been fully exploited yet.

Hope that helps :-)

Best regards,
Karli


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Re: [ViennaCL-devel] Clarifications about the QR Factorization paper

Reply via email to