Hey,

 > For my research I will have to deal with extremely nonsquare matrices.
> That is potentially 32*10 000, 4*80 000, or even 128*1 000 000.
> This often occurs in statistics, where one has a few numbers of
> variables in the first dimensions, and a significant number of samples
> in the other dimension. This results in extremely non-square matrices,
> who do not hold the samebehavior, both implementation-wise and algorithm
> wise
>
> Consider I want to compute the covariance (let's forget the symmetry of
> the resulting matrix for now) of 32 variables over 100 000 samples. An
> implementation for square matrices would probably launch it on one
> single result block/2-4 work groups (the result matrix is only 32*32).
> Indeed, what we want here is not one work-group per block in the result
> matrix. What we would want here would be more something like multiple
> matrix-vector product. for each column of the result matrix. That way,
> if each matrix-vector product launches on 32 work groups (one work group
> per row), we get a much better occupancy.
> For even more extreme case (4*100 000, think of 4 images of 100 000
> pixel each, for example), we might just want to go for the computation
> of 16 inner products.
>
> The problem doesn't really arise when the resulting matrix is big enough
> (256*100 000) since this results in enough result blocks.
>
> This would be, to me, more than a nice-to-have feature, since without
> dispatching, GPUs don't make any sense for these operations. Now, the
> question is : how to handle it? Do you guys think we should do it
> transparently in the backend, or let the user choose the dispatching he
> wants?
> C = viennacl::linalg::prod(A, B, compute_bound()) //default
> C = viennacl::linalg::prod(A, B, bandwidth_compute_bound())) //GEMV based
> C = viennacl::linalg::prod(A, B, bandwidth_bound()) //DOT based

Thanks, Phil, this is something I hope to provide more cleverness for. 
Since the estimations about whether it is compute or bandwidth bound can 
be made based on the matrix dimensions only (of course, assuming dense 
matrices here), I prefer to make any dispatch entirely automatic in the 
background, just like we do now for the device-specific kernels. Can you 
think of any scenario where a user really wants to have control over this?

Best regards,
Karli


------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to