Yeah, much more reasonable - nice to know that we can get full GPU
performance from breeze/netlib-java - meaning there's no compelling
performance reason to switch out our current linear algebra library
(at least as far as this benchmark is concerned).
Instead, it looks like a user guide for configuring Spark/MLlib to use
the right BLAS library will get us most of the way there. Or, would it
make sense to finally ship openblas compiled for some common platforms
(64-bit linux, windows, mac) directly with Spark - hopefully
eliminating the jblas warnings once and for all for most users?
(Licensing is BSD) Or am I missing something?
On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander
<alexander.ula...@hp.com <mailto:alexander.ula...@hp.com>> wrote:
As everyone suggested, the results were too good to be true, so I
double-checked them. It turns that nvblas did not do
multiplication due to parameter NVBLAS_TILE_DIM from "nvblas.conf"
and returned zero matrix. My previously posted results with nvblas
are matrices copying only. The default NVBLAS_TILE_DIM==2048 is
too big for my graphic card/matrix size. I handpicked other values
that worked. As a result, netlib+nvblas is on par with
BIDMat-cuda. As promised, I am going to post a how-to for nvblas
configuration.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
-----Original Message-----
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org <mailto:dev@spark.apache.org>; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra
Hi again,
I finally managed to use nvblas within Spark+netlib-java. It has
exceptional performance for big matrices with Double, faster than
BIDMat-cuda with Float. But for smaller matrices, if you will copy
them to/from GPU, OpenBlas or MKL might be a better choice. This
correlates with original nvblas presentation on GPU conf 2013
(slide 21):
http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
My results:
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Just in case, these tests are not for generalization of
performance of different libraries. I just want to pick a library
that does at best dense matrices multiplication for my task.
P.S. My previous issue with nvblas was the following: it has
Fortran blas functions, at the same time netlib-java uses C cblas
functions. So, one needs cblas shared library to use nvblas
through netlib-java. Fedora does not have cblas (but Debian and
Ubuntu have), so I needed to compile it. I could not use cblas
from Atlas or Openblas because they link to their implementation
and not to Fortran blas.
Best regards, Alexander
-----Original Message-----
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org <mailto:dev@spark.apache.org>; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra
Hi,
I am trying to use nvblas with netlib-java from Spark. nvblas
functions should replace current blas functions calls after
executing LD_PRELOAD as suggested in
http://docs.nvidia.com/cuda/nvblas/#Usage without any changes to
netlib-java. It seems to work for simple Java example, but I
cannot make it work with Spark. I run the following:
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so
./spark-shell --driver-memory 4G In nvidia-smi I observe that Java
is to use GPU:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8873 C bash 39MiB |
| 0 8910 C /usr/lib/jvm/java-1.7.0/bin/java
39MiB |
+-----------------------------------------------------------------------------+
In Spark shell I do matrix multiplication and see the following:
15/03/25 06:48:01 INFO JniLoader: successfully loaded
/tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
So I am sure that netlib-native is loaded and cblas supposedly
used. However, matrix multiplication does executes on CPU since I
see 16% of CPU used and 0% of GPU used. I also checked different
matrix sizes, from 100x100 to 12000x12000
Could you suggest might the LD_PRELOAD not affect Spark shell?
Best regards, Alexander
From: Sam Halliday [mailto:sam.halli...@gmail.com
<mailto:sam.halli...@gmail.com>]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org <mailto:dev@spark.apache.org>; Xiangrui
Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra
Thanks so much for following up on this!
Hmm, I wonder if we should have a concerted effort to chart
performance on various pieces of hardware...
On 9 Mar 2015 21:08, "Ulanov, Alexander" <alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>> wrote:
Hi Everyone, I've updated the benchmark as Xiangrui suggested.
Added the comment that BIDMat 0.9.7 uses Float matrices in GPU
(although I see the support of Double in the current source code),
did the test with BIDMat and CPU Double matrices. BIDMat MKL is
indeed on par with netlib MKL.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
Best regards, Alexander
-----Original Message-----
From: Sam Halliday [mailto:sam.halli...@gmail.com
<mailto:sam.halli...@gmail.com><mailto:sam.halli...@gmail.com
<mailto:sam.halli...@gmail.com>>]
Sent: Tuesday, March 03, 2015 1:54 PM
To: Xiangrui Meng; Joseph Bradley
Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>
Subject: Re: Using CUDA within Spark / boosting linear algebra
BTW, is anybody on this list going to the London Meetup in a few
weeks?
https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
Would be nice to meet other people working on the guts of Spark! :-)
Xiangrui Meng <men...@gmail.com
<mailto:men...@gmail.com><mailto:men...@gmail.com
<mailto:men...@gmail.com>>> writes:
> Hey Alexander,
>
> I don't quite understand the part where netlib-cublas is about 20x
> slower than netlib-openblas. What is the overhead of using a GPU
BLAS
> with netlib-java?
>
> CC'ed Sam, the author of netlib-java.
>
> Best,
> Xiangrui
>
> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
<jos...@databricks.com
<mailto:jos...@databricks.com><mailto:jos...@databricks.com
<mailto:jos...@databricks.com>>> wrote:
>> Better documentation for linking would be very helpful! Here's
a JIRA:
>> https://issues.apache.org/jira/browse/SPARK-6019
>>
>>
>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks
>> <evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>>>
>> wrote:
>>
>>> Thanks for compiling all the data and running these benchmarks,
>>> Alex. The big takeaways here can be seen with this chart:
>>>
>>>
https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZ
>>> Hl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>
>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>> BIDMat+GPU) can provide substantial (but less than an order of
>>> BIDMat+magnitude)
>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>> netlib-java+openblas-compiled).
>>> 2) A poorly tuned CPU implementation can be 1-2 orders of
magnitude
>>> worse than a well-tuned CPU implementation, particularly for
larger matrices.
>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib -
this
>>> basically agrees with the authors own benchmarks (
>>> https://github.com/fommil/netlib-java)
>>>
>>> I think that most of our users are in a situation where using GPUs
>>> may not be practical - although we could consider having a
good GPU
>>> backend available as an option. However, *ALL* users of MLlib
could
>>> benefit (potentially tremendously) from using a well-tuned
CPU-based
>>> BLAS implementation. Perhaps we should consider updating the mllib
>>> guide with a more complete section for enabling high performance
>>> binaries on OSX and Linux? Or better, figure out a way for the
>>> system to fetch these automatically.
>>>
>>> - Evan
>>>
>>>
>>>
>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>> wrote:
>>>
>>>> Just to summarize this thread, I was finally able to make all
>>>> performance comparisons that we discussed. It turns out that:
>>>> BIDMat-cublas>>BIDMat
>>>>
MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo=
>>>> =netlib-cublas>netlib-blas>f2jblas
>>>>
>>>> Below is the link to the spreadsheet with full results.
>>>>
>>>>
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx
>>>> 378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>> One thing still needs exploration: does BIDMat-cublas perform
>>>> copying to/from machine’s RAM?
>>>>
>>>> -----Original Message-----
>>>> From: Ulanov, Alexander
>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>> To: Evan R. Sparks
>>>> Cc: Joseph Bradley;
>>>> dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Thanks, Evan! It seems that ticket was marked as duplicate though
>>>> the original one discusses slightly different topic. I was
able to
>>>> link netlib with MKL from BIDMat binaries. Indeed, MKL is
>>>> statically linked inside a 60MB library.
>>>>
>>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat|
>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>>
+-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 |
0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>> |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697
|445,0935211 |
>>>> 1569,233228 |
>>>>
>>>> It turn out that pre-compiled MKL is faster than precompiled
>>>> OpenBlas on my machine. Probably, I’ll add two more columns with
>>>> locally compiled openblas and cuda.
>>>>
>>>> Alexander
>>>>
>>>> From: Evan R. Sparks
>>>> [mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>>]
>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Great - perhaps we can move this discussion off-list and onto a
>>>> JIRA ticket? (Here's one:
>>>> https://issues.apache.org/jira/browse/SPARK-5705)
>>>>
>>>> It seems like this is going to be somewhat exploratory for a
while
>>>> (and there's probably only a handful of us who really care about
>>>> fast linear
>>>> algebra!)
>>>>
>>>> - Evan
>>>>
>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for explanation and useful link. I am going to build
>>>> OpenBLAS, link it with Netlib-java and perform benchmark again.
>>>>
>>>> Do I understand correctly that BIDMat binaries contain statically
>>>> linked Intel MKL BLAS? It might be the reason why I am able
to run
>>>> BIDMat not having MKL BLAS installed on my server. If it is
true, I
>>>> wonder if it is OK because Intel sells this library.
Nevertheless,
>>>> it seems that in my case precompiled MKL BLAS performs better
than
>>>> precompiled OpenBLAS given that BIDMat and Netlib-java are
supposed to be on par with JNI overheads.
>>>>
>>>> Though, it might be interesting to link Netlib-java with
Intel MKL,
>>>> as you suggested. I wonder, are John Canny (BIDMat) and Sam
>>>> Halliday
>>>> (Netlib-java) interested to compare their libraries.
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>><mailto:
>>>> evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>>>]
>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
>>>> apache.org <http://apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I would build OpenBLAS yourself, since good BLAS performance
comes
>>>> from getting cache sizes, etc. set up correctly for your
particular
>>>> hardware - this is often a very tricky process (see, e.g. ATLAS),
>>>> but we found that on relatively modern Xeon chips, OpenBLAS
builds
>>>> quickly and yields performance competitive with MKL.
>>>>
>>>> To make sure the right library is getting used, you have to make
>>>> sure it's first on the search path - export
>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>
>>>> For some examples of getting netlib-java setup on an ec2 node and
>>>> some example benchmarking code we ran a while back, see:
>>>> https://github.com/shivaram/matrix-bench
>>>>
>>>> In particular - build-openblas-ec2.sh shows you how to build the
>>>> library and set up symlinks correctly, and scala/run-netlib.sh
>>>> shows you how to get the path setup and get that library
picked up by netlib-java.
>>>>
>>>> In this way - you could probably get cuBLAS set up to be used by
>>>> netlib-java as well.
>>>>
>>>> - Evan
>>>>
>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>>> wrote:
>>>> Evan, could you elaborate on how to force BIDMat and
netlib-java to
>>>> force loading the right blas? For netlib, I there are few JVM
>>>> flags, such as
>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS,
>>>> so I can force it to use Java implementation. Not sure I
understand how to force use a specific blas (not specific wrapper
for blas).
>>>>
>>>> Btw. I have installed openblas (yum install openblas), so I
suppose
>>>> that netlib is using it.
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>><mailto:
>>>> evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>>>]
>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Joseph Bradley;
>>>> dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
>>>> apache.org <http://apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Getting breeze to pick up the right blas library is critical for
>>>> performance. I recommend using OpenBLAS (or MKL, if you
already have it).
>>>> It might make sense to force BIDMat to use the same
underlying BLAS
>>>> library as well.
>>>>
>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>>> wrote:
>>>> Hi Evan, Joseph
>>>>
>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
>>>> faster than netlib-java+breeze (sorry for weird table
formatting):
>>>>
>>>> |A*B size | BIDMat MKL | Breeze+Netlib-java
>>>> |native_system_linux_x86-64|
>>>> Breeze+Netlib-java f2jblas |
>>>>
+-----------------------------------------------------------------------+
>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 |
1569,233228
>>>> ||
>>>>
>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM,
Fedora
>>>> 19 Linux, Scala 2.11.
>>>>
>>>> Later I will make tests with Cuda. I need to install new Cuda
>>>> version for this purpose.
>>>>
>>>> Do you have any ideas why breeze-netlib with native blas is
so much
>>>> slower than BIDMat MKL?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Joseph Bradley [mailto:jos...@databricks.com
<mailto:jos...@databricks.com><mailto:jos...@databricks.com
<mailto:jos...@databricks.com>><mailto:
>>>> jos...@databricks.com
<mailto:jos...@databricks.com><mailto:jos...@databricks.com
<mailto:jos...@databricks.com>>>]
>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc: Evan R. Sparks;
>>>> dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
>>>> apache.org <http://apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi Alexander,
>>>>
>>>> Using GPUs with Spark would be very exciting. Small comment:
>>>> Concerning your question earlier about keeping data stored on the
>>>> GPU rather than having to move it between main memory and GPU
>>>> memory on each iteration, I would guess this would be critical to
>>>> getting good performance. If you could do multiple local
>>>> iterations before aggregating results, then the cost of data
>>>> movement to the GPU could be amortized (and I believe that is
done
>>>> in practice). Having Spark be aware of the GPU and using it
as another part of memory sounds like a much bigger undertaking.
>>>>
>>>> Joseph
>>>>
>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>>> wrote:
>>>> Thank you for explanation! I’ve watched the BIDMach
presentation by
>>>> John Canny and I am really inspired by his talk and
comparisons with Spark MLlib.
>>>>
>>>> I am very interested to find out what will be better within
Spark:
>>>> BIDMat or netlib-java with CPU or GPU natives. Could you
suggest a
>>>> fair way to benchmark them? Currently I do benchmarks on
artificial
>>>> neural networks in batch mode. While it is not a “pure” test of
>>>> linear algebra, it involves some other things that are
essential to machine learning.
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>><mailto:
>>>> evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>>>]
>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>> To: Ulanov, Alexander
>>>> Cc:
>>>> dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
>>>> apache.org <http://apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>> netlib-java+OpenBLAS, but if it is much faster it's probably
due to
>>>> netlib-java+data
>>>> layout and fewer levels of indirection - it's definitely a
>>>> worthwhile experiment to run. The main speedups I've seen from
>>>> using it come from highly optimized GPU code for linear
algebra. I
>>>> know that in the past Canny has gone as far as to write
custom GPU
>>>> kernels for performance-critical regions of code.[1]
>>>>
>>>> BIDMach is highly optimized for single node performance or
>>>> performance on small clusters.[2] Once data doesn't fit easily in
>>>> GPU memory (or can be batched in that way) the performance
tends to
>>>> fall off. Canny argues for hardware/software codesign and as such
>>>> prefers machine configurations that are quite different than what
>>>> we find in most commodity cluster nodes - e.g. 10 disk
cahnnels and 4 GPUs.
>>>>
>>>> In contrast, MLlib was designed for horizontal scalability on
>>>> commodity clusters and works best on very big datasets -
order of terabytes.
>>>>
>>>> For the most part, these projects developed concurrently to
address
>>>> slightly different use cases. That said, there may be bits of
>>>> BIDMach we could repurpose for MLlib - keep in mind we need to be
>>>> careful about maintaining cross-language compatibility for
our Java
>>>> and Python-users, though.
>>>>
>>>> - Evan
>>>>
>>>> [1] - http://arxiv.org/abs/1409.5402 [2] -
>>>> http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
<http://eecs.berkeley.edu/%7Ehzhao/papers/BD.pdf>
>>>>
>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>><mailto:
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>>>> wrote:
>>>> Hi Evan,
>>>>
>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do
>>>> you know what makes them faster than netlib-java?
>>>>
>>>> The same group has BIDMach library that implements machine
>>>> learning. For some examples they use Caffe convolutional neural
>>>> network library owned by another group in Berkeley. Could you
>>>> elaborate on how these all might be connected with Spark
Mllib? If
>>>> you take BIDMat for linear algebra why don’t you take BIDMach
for optimization and learning?
>>>>
>>>> Best regards, Alexander
>>>>
>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>><mailto:
>>>> evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>>><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>><mailto:
>>>> evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com><mailto:evan.spa...@gmail.com
<mailto:evan.spa...@gmail.com>>>>]
>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>> To: Ulanov, Alexander
>>>> Cc: dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>><mailto:
>>>> dev@spark.apache.org
<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>><mailto:dev@spark <mailto:dev@spark>.
>>>> apache.org <http://apache.org><mailto:dev@spark.apache.org
<mailto:dev@spark.apache.org>>>>
>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
>>>> blas in many cases.
>>>>
>>>> You might consider taking a look at the codepaths that BIDMat (
>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
>>>> optimizing to make this work really fast from Scala. I've run
it on
>>>> my laptop and compared to MKL and in certain cases it's 10x
faster at matrix multiply.
>>>> There are a lot of layers of indirection here and you really want
>>>> to avoid data copying as much as possible.
>>>>
>>>> We could also consider swapping out BIDMat for Breeze, but that
>>>> would be a big project and if we can figure out how to get
>>>> breeze+cublas to comparable performance that would be a big win.
>>>>
>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>><mailto:
>>>> alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com
<mailto:alexander.ula...@hp.com>>>>> wrote:
>>>> Dear Spark developers,
>>>>
>>>> I am exploring how to make linear algebra operations faster
within Spark.
>>>> One way of doing this is to use Scala Breeze library that is
>>>> bundled with Spark. For matrix operations, it employs Netlib-java
>>>> that has a Java wrapper for BLAS (basic linear algebra
subprograms)
>>>> and LAPACK native binaries if they are available on the worker
>>>> node. It also has its own optimized Java implementation of
BLAS. It
>>>> is worth mentioning, that native binaries provide better
performance only for BLAS level 3, i.e.
>>>> matrix-matrix operations or general matrix multiplication (GEMM).
>>>> This is confirmed by GEMM test on Netlib-java page
>>>> https://github.com/fommil/netlib-java. I also confirmed it
with my
>>>> experiments with training of artificial neural network
>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>> However, I would like to boost performance more.
>>>>
>>>> GPU is supposed to work fast with linear algebra and there is
>>>> Nvidia CUDA implementation of BLAS, called cublas. I have one
Linux
>>>> server with Nvidia GPU and I was able to do the following. I
linked
>>>> cublas (instead of cpu-based blas) with Netlib-java wrapper
and put
>>>> it into Spark, so Breeze/Netlib is using it. Then I did some
>>>> performance measurements with regards to artificial neural
network
>>>> batch learning in Spark MLlib that involves matrix-matrix
>>>> multiplications. It turns out that for matrices of size less than
>>>> ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
becomes
>>>> slower for bigger matrices. It worth mentioning that it is
was not a test for ONLY multiplication since there are other
operations involved.
>>>> One of the reasons for slowdown might be the overhead of copying
>>>> the matrices from computer memory to graphic card memory and
back.
>>>>
>>>> So, few questions:
>>>> 1) Do these results with CUDA make sense?
>>>> 2) If the problem is with copy overhead, are there any libraries
>>>> that allow to force intermediate results to stay in graphic card
>>>> memory thus removing the overhead?
>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>
>>>> Thank you, Alexander
>>>>
>>>>
-------------------------------------------------------------------
>>>> -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org><mailto:dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org>><mailto:
>>>> dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org><mailto:dev-unsubscribe@spark.apach
<mailto:dev-unsubscribe@spark.apach>
>>>> e.org <http://e.org>>><mailto:dev-unsubscr...@spark.apac
<mailto:dev-unsubscr...@spark.apac><mailto:dev-unsubscribe@sp
<mailto:dev-unsubscribe@sp>
>>>> ark.apac> he.org <http://he.org><http://he.org>
>>>> <mailto:dev-unsubscr...@spark.apache.org
<mailto:dev-unsubscr...@spark.apache.org><mailto:dev-unsubscribe@spa
<mailto:dev-unsubscribe@spa>
>>>> rk.apache.org <http://rk.apache.org>>>> For additional
commands, e-mail:
>>>> dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org>><mailto:
>>>> dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org>>><mailto:dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org>><mailto:
>>>> dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org><mailto:dev-h...@spark.apache.org
<mailto:dev-h...@spark.apache.org>>>>
>>>>
>>>>
>>>>
>>>>
>>>
--
Best regards,
Sam