Hey hey,

I can't think of any such case where one would want to have control over
this. This would require knowledge of our implementations to make
appropriate choices anyway. In order to have a reasonable decision process,
we need to come up with some heuristics...
My first idea would be to compute a degree of "compute-boundness", and to
dispatch according to it.

[image: T_{compute} = 2.\frac{MKN}{Peak_{compute}]

[image: T_{bandwidth} = S.\frac{MK + KN}{Peak_{bandwidth}}]

[image: k = \frac{T_{compute}}{T_{bandwidth}} = \alpha.\frac{MN}{M+N}]

[image: \alpha = 2.\frac{Peak_{bandwidth}}{S.Peak_{compute}}]

Now, for the dispatching, we can proceed the following way :

[image: k < k_1 \Rightarrow \textrm{Multiple Inner Product}]
[image: k_1 \leq k \leq k_2 \Rightarrow \textrm{Multiple Matrix-Vector
Product}]
[image: k_2 < k \Rightarrow \textrm{Matrix-Matrix Product}]

Since the bandwidth and the compute power are device-dependent, theory
would suggest that this "compute-boundness" degree should also be device
dependent. However, I think we can reason for now in terms of order of
magnitude, assuming 100GB/s for 1TFLOP/s... Assume these numbers, square
matrices M=N, and S=4bytes (float), we get :

[image: k = 0.025 . N]

now the choice of k_1 and k_2 seems purely empirical...

The problem of this model is that it doesn't take the second size K... nor
does it take in account the kernel launch overhead, which prevents us from
computing too many inner products : Taking k_1 =1 leads to inner-products
being used up until N=40, which involves the computation of 1600 inner
products... Now, even if we can pack inner products together, it's still
too huge to be practical.
It's just a first shot, of course, and any idea/hint is more than welcome :P

Best regards,
Philippe

PS : I hope the LaTeX images pass fine...




2013/9/5 Karl Rupp <r...@iue.tuwien.ac.at>

> Hey,
>
>  > For my research I will have to deal with extremely nonsquare matrices.
> > That is potentially 32*10 000, 4*80 000, or even 128*1 000 000.
> > This often occurs in statistics, where one has a few numbers of
> > variables in the first dimensions, and a significant number of samples
> > in the other dimension. This results in extremely non-square matrices,
> > who do not hold the samebehavior, both implementation-wise and algorithm
> > wise
> >
> > Consider I want to compute the covariance (let's forget the symmetry of
> > the resulting matrix for now) of 32 variables over 100 000 samples. An
> > implementation for square matrices would probably launch it on one
> > single result block/2-4 work groups (the result matrix is only 32*32).
> > Indeed, what we want here is not one work-group per block in the result
> > matrix. What we would want here would be more something like multiple
> > matrix-vector product. for each column of the result matrix. That way,
> > if each matrix-vector product launches on 32 work groups (one work group
> > per row), we get a much better occupancy.
> > For even more extreme case (4*100 000, think of 4 images of 100 000
> > pixel each, for example), we might just want to go for the computation
> > of 16 inner products.
> >
> > The problem doesn't really arise when the resulting matrix is big enough
> > (256*100 000) since this results in enough result blocks.
> >
> > This would be, to me, more than a nice-to-have feature, since without
> > dispatching, GPUs don't make any sense for these operations. Now, the
> > question is : how to handle it? Do you guys think we should do it
> > transparently in the backend, or let the user choose the dispatching he
> > wants?
> > C = viennacl::linalg::∏(A, B, computebound()) //default
> > C = viennacl::linalg::∏(A, B, bandwidthcomputebound())) //GEMV based
> > C = viennacl::linalg::∏(A, B, bandwidthbound()) //DOT based
>
> Thanks, Phil, this is something I hope to provide more cleverness for.
> Since the estimations about whether it is compute or bandwidth bound can
> be made based on the matrix dimensions only (of course, assuming dense
> matrices here), I prefer to make any dispatch entirely automatic in the
> background, just like we do now for the device-specific kernels. Can you
> think of any scenario where a user really wants to have control over this?
>
> Best regards,
> Karli
>
>
>
> ------------------------------------------------------------------------------
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and save!
> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
> _______________________________________________
> ViennaCL-devel mailing list
> ViennaCL-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to