Hi Aanchan, > I went through the paper you sent me that talks about sub-vectors and > sub-matrices for the Householder reflections by the use of what you call > proxy objects. > > I had some questions that stem from reading that paper. > > 1. About the back-end: The paper talks about OpenCL as the back-end. As > I understand OpenCL uses a Just-in-Time compiler, for an equivalent > OpenCL kernel to be generated on the fly for each high-level expression > with a ViennaCL type. What about the case of CUDA which uses the nvcc > compiler more like the g++ compiler, how does that get translated for > expressions containing ViennaCL types?
By the time the paper was written ViennaCL provided only an OpenCL backend. You are right that OpenCL uses a just-in-time compiler. Since the overhead of launching the jit-compiler is high, we use predefined sets of operations (similar to BLAS) internally grouped into different modules, which get compiled on first access. For example, all operations on a compressed_matrix constitute one such module. Similarly, vector operations constitute one such module, etc. Since these operations are known a-priori, we can provide the same routines implemented in CUDA going through nvcc and dispatch as needed. For 'complicated' operations such as the vector operation x = y + z - x + y - z; there are temporary objects introduced such that the 'complicated' operations is decomposed into smaller operations supported by the backend. This code path is again the same for all three backends. > 2. I am assuming the use of temporary variables and swapping between the > CPU and GPU should be minimized during development for memory and speed > reasons. Also one should probably be careful to understand how often the > JIT OpenCL kernel generation is launched. Yes, reducing temporaries and host-device communication is essential. Our current jit handling policy is a (imho fairly good) compromise of execution performance and minimizing jit-overhead. > 3. I did also read the Halko-Tropp(http://arxiv.org/pdf/0909.4061.pdf) > SVD paper on the Rank-k SVD approximations using a random projection. > Their chief argument is that their algorithm for rank-k SVD > approximation of an m by n matrix is O(mn*log(k)) compared to the usual > O(mnk). I am assuming that is the reason you pointed me to that work. I > still am to run profiling on the current SVD implementation, which I > guess could be slow for a number of reasons, including the anything in > the code that might cause a PCI-e transfer. Our current SVD implementation is not optimal with respect to balancing the sequential computations run on the host with the massively parallel computations run on the device (i.e. matrix-matrix products). Similar to the case of the QR factorization, one can adaptively choose the panel size such that the computations on the host are in sync with the offloaded computations on the device - such optimization options haven't been fully exploited yet. Hope that helps :-) Best regards, Karli ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel