Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning
your question earlier about keeping data stored on the GPU rather than
having to move it between main memory and GPU memory on each iteration, I
would guess this would be critical to getting good performance.  If you
could do multiple local iterations before aggregating results, then the
cost of data movement to the GPU could be amortized (and I believe that is
done in practice).  Having Spark be aware of the GPU and using it as
another part of memory sounds like a much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <alexander.ula...@hp.com>
wrote:

> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection - it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performance-critical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes -
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets - order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib - keep in mind we need to be careful about
> maintaining cross-language compatibility for our Java and Python-users,
> though.
>
> - Evan
>
> [1] - http://arxiv.org/abs/1409.5402
> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <alexander.ula...@hp.com
> <mailto:alexander.ula...@hp.com>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlib-java?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlib-java that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrix-matrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlib-java page
> https://github.com/fommil/netlib-java. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment-70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrix-matrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speed-up linear algebra in Spark?
>
> Thank you, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:
> dev-h...@spark.apache.org>
>
>
>

Reply via email to