The copying overhead should be quadratic on n, while the computation
cost is cubic on n. I can understand that netlib-cublas is slower than
netlib-openblas on small problems. But I'm surprised to see that it is
still 20x slower on 10000x10000. I did the following on a g2.2xlarge
instance with BIDMat:

val n = 10000

val f = rand(n, n)
flip; f*f; val rf = flop

flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop

flip; g*g; val rgg = flop

The CPU version finished in 12 seconds.
The CPU->GPU->CPU version finished in 2.2 seconds.
The GPU version finished in 1.7 seconds.

I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
path. But based on the result, the data copying overhead is definitely
not as big as 20x at n = 10000.

Best,
Xiangrui


On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <sam.halli...@gmail.com> wrote:
> I've had some email exchanges with the author of BIDMat: it does exactly
> what you need to get the GPU benefit and writes higher level algorithms
> entirely in the GPU kernels so that the memory stays there as long as
> possible. The restriction with this approach is that it is only offering
> high-level algorithms so is not a toolkit for applied mathematics
> research and development --- but it works well as a toolkit for higher
> level analysis (e.g. for analysts and practitioners).
>
> I believe BIDMat's approach is the best way to get performance out of
> GPU hardware at the moment but I also have strong evidence to suggest
> that the hardware will catch up and the memory transfer costs between
> CPU/GPU will disappear meaning that there will be no need for custom GPU
> kernel implementations. i.e. please continue to use BLAS primitives when
> writing new algorithms and only go to the GPU for an alternative
> optimised implementation.
>
> Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
> an API that looks like BLAS but takes pointers to special regions in the
> GPU memory region. Somebody has written a wrapper around CUDA to create
> a proper BLAS library but it only gives marginal performance over the
> CPU because of the memory transfer overhead.
>
> This slide from my talk
>
>   http://fommil.github.io/scalax14/#/11/2
>
> says it all. X axis is matrix size, Y axis is logarithmic time to do
> DGEMM. Black line is the "cheating" time for the GPU and the green line
> is after copying the memory to/from the GPU memory. APUs have the
> potential to eliminate the green line.
>
> Best regards,
> Sam
>
>
>
> "Ulanov, Alexander" <alexander.ula...@hp.com> writes:
>
>> Evan, thank you for the summary. I would like to add some more observations. 
>> The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They 
>> both are 3 years old. I've also did a small test with modern hardware, and 
>> the new GPU nVidia Titan was slightly more than 1 order of magnitude faster 
>> than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU 
>> ($1200). My takeaway is that GPU is making a better price/value progress.
>>
>>
>>
>> Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda 
>> and the most reasonable explanation is that it holds the result in GPU 
>> memory, as Sam suggested. At the same time, it is OK because you can copy 
>> the result back from GPU only when needed. However, to be sure, I am going 
>> to ask the developer of BIDMat on his upcoming talk.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>> From: Sam Halliday [mailto:sam.halli...@gmail.com]
>> Sent: Thursday, February 26, 2015 1:56 PM
>> To: Xiangrui Meng
>> Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>
>>
>> Btw, I wish people would stop cheating when comparing CPU and GPU timings 
>> for things like matrix multiply :-P
>>
>> Please always compare apples with apples and include the time it takes to 
>> set up the matrices, send it to the processing unit, doing the calculation 
>> AND copying it back to where you need to see the results.
>>
>> Ignoring this method will make you believe that your GPU is thousands of 
>> times faster than it really is. Again, jump to the end of my talk for graphs 
>> and more discussion....  especially the bit about me being keen on funding 
>> to investigate APU hardware further ;-) (I believe it will solve the problem)
>> On 26 Feb 2015 21:16, "Xiangrui Meng" 
>> <men...@gmail.com<mailto:men...@gmail.com>> wrote:
>> Hey Alexander,
>>
>> I don't quite understand the part where netlib-cublas is about 20x
>> slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> with netlib-java?
>>
>> CC'ed Sam, the author of netlib-java.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
>> <jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
>>> Better documentation for linking would be very helpful!  Here's a JIRA:
>>> https://issues.apache.org/jira/browse/SPARK-6019
>>>
>>>
>>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
>>> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>
>>> wrote:
>>>
>>>> Thanks for compiling all the data and running these benchmarks, Alex. The
>>>> big takeaways here can be seen with this chart:
>>>>
>>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>>>>
>>>> 1) A properly configured GPU matrix multiply implementation (e.g.
>>>> BIDMat+GPU) can provide substantial (but less than an order of magnitude)
>>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>>>> netlib-java+openblas-compiled).
>>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
>>>> than a well-tuned CPU implementation, particularly for larger matrices.
>>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>>>> basically agrees with the authors own benchmarks (
>>>> https://github.com/fommil/netlib-java)
>>>>
>>>> I think that most of our users are in a situation where using GPUs may not
>>>> be practical - although we could consider having a good GPU backend
>>>> available as an option. However, *ALL* users of MLlib could benefit
>>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
>>>> implementation. Perhaps we should consider updating the mllib guide with a
>>>> more complete section for enabling high performance binaries on OSX and
>>>> Linux? Or better, figure out a way for the system to fetch these
>>>> automatically.
>>>>
>>>> - Evan
>>>>
>>>>
>>>>
>>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
>>>>
>>>>> Just to summarize this thread, I was finally able to make all performance
>>>>> comparisons that we discussed. It turns out that:
>>>>> BIDMat-cublas>>BIDMat
>>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>>>>>
>>>>> Below is the link to the spreadsheet with full results.
>>>>>
>>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>>
>>>>> One thing still needs exploration: does BIDMat-cublas perform copying
>>>>> to/from machine’s RAM?
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ulanov, Alexander
>>>>> Sent: Tuesday, February 10, 2015 2:12 PM
>>>>> To: Evan R. Sparks
>>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
>>>>> original one discusses slightly different topic. I was able to link netlib
>>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
>>>>> 60MB library.
>>>>>
>>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
>>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
>>>>> +-----------------------------------------------------------------------+
>>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
>>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
>>>>> |1,638475459 |
>>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
>>>>> 1569,233228 |
>>>>>
>>>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
>>>>> my machine. Probably, I’ll add two more columns with locally compiled
>>>>> openblas and cuda.
>>>>>
>>>>> Alexander
>>>>>
>>>>> From: Evan R. Sparks 
>>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>]
>>>>> Sent: Monday, February 09, 2015 6:06 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
>>>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>>>>>
>>>>> It seems like this is going to be somewhat exploratory for a while (and
>>>>> there's probably only a handful of us who really care about fast linear
>>>>> algebra!)
>>>>>
>>>>> - Evan
>>>>>
>>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>>>>>  wrote:
>>>>> Hi Evan,
>>>>>
>>>>> Thank you for explanation and useful link. I am going to build OpenBLAS,
>>>>> link it with Netlib-java and perform benchmark again.
>>>>>
>>>>> Do I understand correctly that BIDMat binaries contain statically linked
>>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
>>>>> having MKL BLAS installed on my server. If it is true, I wonder if it is 
>>>>> OK
>>>>> because Intel sells this library. Nevertheless, it seems that in my case
>>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
>>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>>>>>
>>>>> Though, it might be interesting to link Netlib-java with Intel MKL, as
>>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
>>>>> (Netlib-java) interested to compare their libraries.
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>> From: Evan R. Sparks 
>>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>>>>> Sent: Friday, February 06, 2015 5:58 PM
>>>>>
>>>>> To: Ulanov, Alexander
>>>>> Cc: Joseph Bradley; 
>>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> I would build OpenBLAS yourself, since good BLAS performance comes from
>>>>> getting cache sizes, etc. set up correctly for your particular hardware -
>>>>> this is often a very tricky process (see, e.g. ATLAS), but we found that 
>>>>> on
>>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
>>>>> performance competitive with MKL.
>>>>>
>>>>> To make sure the right library is getting used, you have to make sure
>>>>> it's first on the search path - export
>>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
>>>>>
>>>>> For some examples of getting netlib-java setup on an ec2 node and some
>>>>> example benchmarking code we ran a while back, see:
>>>>> https://github.com/shivaram/matrix-bench
>>>>>
>>>>> In particular - build-openblas-ec2.sh shows you how to build the library
>>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to 
>>>>> get
>>>>> the path setup and get that library picked up by netlib-java.
>>>>>
>>>>> In this way - you could probably get cuBLAS set up to be used by
>>>>> netlib-java as well.
>>>>>
>>>>> - Evan
>>>>>
>>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>>>>>  wrote:
>>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to force
>>>>> loading the right blas? For netlib, I there are few JVM flags, such as
>>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can
>>>>> force it to use Java implementation. Not sure I understand how to force 
>>>>> use
>>>>> a specific blas (not specific wrapper for blas).
>>>>>
>>>>> Btw. I have installed openblas (yum install openblas), so I suppose that
>>>>> netlib is using it.
>>>>>
>>>>> From: Evan R. Sparks 
>>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>>>>> Sent: Friday, February 06, 2015 5:19 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: Joseph Bradley; 
>>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>>>>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Getting breeze to pick up the right blas library is critical for
>>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it).
>>>>> It might make sense to force BIDMat to use the same underlying BLAS 
>>>>> library
>>>>> as well.
>>>>>
>>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>>>>>  wrote:
>>>>> Hi Evan, Joseph
>>>>>
>>>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster
>>>>> than netlib-java+breeze (sorry for weird table formatting):
>>>>>
>>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
>>>>> Breeze+Netlib-java f2jblas |
>>>>> +-----------------------------------------------------------------------+
>>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
>>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
>>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
>>>>>
>>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
>>>>> Linux, Scala 2.11.
>>>>>
>>>>> Later I will make tests with Cuda. I need to install new Cuda version for
>>>>> this purpose.
>>>>>
>>>>> Do you have any ideas why breeze-netlib with native blas is so much
>>>>> slower than BIDMat MKL?
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>> From: Joseph Bradley 
>>>>> [mailto:jos...@databricks.com<mailto:jos...@databricks.com><mailto:
>>>>> jos...@databricks.com<mailto:jos...@databricks.com>>]
>>>>> Sent: Thursday, February 05, 2015 5:29 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: Evan R. Sparks; 
>>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> Hi Alexander,
>>>>>
>>>>> Using GPUs with Spark would be very exciting.  Small comment: Concerning
>>>>> your question earlier about keeping data stored on the GPU rather than
>>>>> having to move it between main memory and GPU memory on each iteration, I
>>>>> would guess this would be critical to getting good performance.  If you
>>>>> could do multiple local iterations before aggregating results, then the
>>>>> cost of data movement to the GPU could be amortized (and I believe that is
>>>>> done in practice).  Having Spark be aware of the GPU and using it as
>>>>> another part of memory sounds like a much bigger undertaking.
>>>>>
>>>>> Joseph
>>>>>
>>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>
>>>>>  wrote:
>>>>> Thank you for explanation! I’ve watched the BIDMach presentation by John
>>>>> Canny and I am really inspired by his talk and comparisons with Spark 
>>>>> MLlib.
>>>>>
>>>>> I am very interested to find out what will be better within Spark: BIDMat
>>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
>>>>> benchmark them? Currently I do benchmarks on artificial neural networks in
>>>>> batch mode. While it is not a “pure” test of linear algebra, it involves
>>>>> some other things that are essential to machine learning.
>>>>>
>>>>> From: Evan R. Sparks 
>>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
>>>>> Sent: Thursday, February 05, 2015 1:29 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: 
>>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
>>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
>>>>> layout and fewer levels of indirection - it's definitely a worthwhile
>>>>> experiment to run. The main speedups I've seen from using it come from
>>>>> highly optimized GPU code for linear algebra. I know that in the past 
>>>>> Canny
>>>>> has gone as far as to write custom GPU kernels for performance-critical
>>>>> regions of code.[1]
>>>>>
>>>>> BIDMach is highly optimized for single node performance or performance on
>>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
>>>>> batched in that way) the performance tends to fall off. Canny argues for
>>>>> hardware/software codesign and as such prefers machine configurations that
>>>>> are quite different than what we find in most commodity cluster nodes -
>>>>> e.g. 10 disk cahnnels and 4 GPUs.
>>>>>
>>>>> In contrast, MLlib was designed for horizontal scalability on commodity
>>>>> clusters and works best on very big datasets - order of terabytes.
>>>>>
>>>>> For the most part, these projects developed concurrently to address
>>>>> slightly different use cases. That said, there may be bits of BIDMach we
>>>>> could repurpose for MLlib - keep in mind we need to be careful about
>>>>> maintaining cross-language compatibility for our Java and Python-users,
>>>>> though.
>>>>>
>>>>> - Evan
>>>>>
>>>>> [1] - http://arxiv.org/abs/1409.5402
>>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>>>>>
>>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>>
>>>>>  wrote:
>>>>> Hi Evan,
>>>>>
>>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
>>>>> know what makes them faster than netlib-java?
>>>>>
>>>>> The same group has BIDMach library that implements machine learning. For
>>>>> some examples they use Caffe convolutional neural network library owned by
>>>>> another group in Berkeley. Could you elaborate on how these all might be
>>>>> connected with Spark Mllib? If you take BIDMat for linear algebra why 
>>>>> don’t
>>>>> you take BIDMach for optimization and learning?
>>>>>
>>>>> Best regards, Alexander
>>>>>
>>>>> From: Evan R. Sparks 
>>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
>>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>]
>>>>> Sent: Thursday, February 05, 2015 12:09 PM
>>>>> To: Ulanov, Alexander
>>>>> Cc: 
>>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto:
>>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>>
>>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
>>>>>
>>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
>>>>> many cases.
>>>>>
>>>>> You might consider taking a look at the codepaths that BIDMat (
>>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
>>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work 
>>>>> optimizing
>>>>> to make this work really fast from Scala. I've run it on my laptop and
>>>>> compared to MKL and in certain cases it's 10x faster at matrix multiply.
>>>>> There are a lot of layers of indirection here and you really want to avoid
>>>>> data copying as much as possible.
>>>>>
>>>>> We could also consider swapping out BIDMat for Breeze, but that would be
>>>>> a big project and if we can figure out how to get breeze+cublas to
>>>>> comparable performance that would be a big win.
>>>>>
>>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
>>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>>
>>>>>  wrote:
>>>>> Dear Spark developers,
>>>>>
>>>>> I am exploring how to make linear algebra operations faster within Spark.
>>>>> One way of doing this is to use Scala Breeze library that is bundled with
>>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
>>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
>>>>> binaries if they are available on the worker node. It also has its own
>>>>> optimized Java implementation of BLAS. It is worth mentioning, that native
>>>>> binaries provide better performance only for BLAS level 3, i.e.
>>>>> matrix-matrix operations or general matrix multiplication (GEMM). This is
>>>>> confirmed by GEMM test on Netlib-java page
>>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
>>>>> experiments with training of artificial neural network
>>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
>>>>> However, I would like to boost performance more.
>>>>>
>>>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
>>>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia
>>>>> GPU and I was able to do the following. I linked cublas (instead of
>>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
>>>>> Breeze/Netlib is using it. Then I did some performance measurements with
>>>>> regards to artificial neural network batch learning in Spark MLlib that
>>>>> involves matrix-matrix multiplications. It turns out that for matrices of
>>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
>>>>> becomes slower for bigger matrices. It worth mentioning that it is was not
>>>>> a test for ONLY multiplication since there are other operations involved.
>>>>> One of the reasons for slowdown might be the overhead of copying the
>>>>> matrices from computer memory to graphic card memory and back.
>>>>>
>>>>> So, few questions:
>>>>> 1) Do these results with CUDA make sense?
>>>>> 2) If the problem is with copy overhead, are there any libraries that
>>>>> allow to force intermediate results to stay in graphic card memory thus
>>>>> removing the overhead?
>>>>> 3) Any other options to speed-up linear algebra in Spark?
>>>>>
>>>>> Thank you, Alexander
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: 
>>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org><mailto:
>>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>><mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>
>>>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>>>
>>>>> For additional commands, e-mail: 
>>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
>>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
>>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>
> --
> Best regards,
> Sam
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to