Re: Using CUDA within Spark / boosting linear algebra

Nicholas Chammas Sun, 08 Feb 2015 14:45:00 -0800

Lemme butt in randomly here and say there is an interesting discussion on
this Spark PR <https://github.com/apache/spark/pull/4448> about
netlib-java, JBLAS, Breeze, and other things I know nothing of, that y'all
may find interesting. Among the participants is the author of netlib-java.


On Sun Feb 08 2015 at 2:48:19 AM Ulanov, Alexander <[email protected]>
wrote:

> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlib-java+breeze (sorry for weird table formatting):
>
> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
> Breeze+Netlib-java f2jblas |
> +-----------------------------------------------------------------------+
> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breeze-netlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto:[email protected]]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; [email protected]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting.  Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance.  If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <[email protected]>
> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto:[email protected]]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: [email protected]
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection - it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performance-critical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes -
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets - order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib - keep in mind we need to be careful about
> maintaining cross-language compatibility for our Java and Python-users,
> though.
>
> - Evan
>
> [1] - http://arxiv.org/abs/1409.5402
> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <[email protected]
> <mailto:[email protected]>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlib-java?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto:[email protected]<mailto:
> [email protected]>]
> Sent: Thursday, February 05, 2015 12:09 PM
> To: Ulanov, Alexander
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
> many cases.
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> [email protected]<mailto:[email protected]>> wrote:
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlib-java that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrix-matrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlib-java page https://github.com/fommil/
> netlib-java. I also confirmed it with my experiments with training of
> artificial neural network https://github.com/apache/
> spark/pull/1290#issuecomment-70313952. However, I would like to boost
> performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrix-matrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speed-up linear algebra in Spark?
>
> Thank you, Alexander
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]<mailto:
> [email protected]>
> For additional commands, e-mail: [email protected]<mailto:
> [email protected]>
>
>
>

Re: Using CUDA within Spark / boosting linear algebra

Reply via email to