Don't use "big O" estimates, always measure. It used to work back in the
days when double multiplication was a bottleneck. The computation cost is
effectively free on both the CPU and GPU and you're seeing pure copying
costs. Also, I'm dubious that cublas is doing what you think it is. Can you
link me to the source code for DGEMM?

I show all of this in my talk, with explanations, I can't stress enough how
much I recommend that you watch it if you want to understand high
performance hardware acceleration for linear algebra :-)
On 27 Feb 2015 01:42, "Xiangrui Meng" <men...@gmail.com> wrote:

> The copying overhead should be quadratic on n, while the computation
> cost is cubic on n. I can understand that netlib-cublas is slower than
> netlib-openblas on small problems. But I'm surprised to see that it is
> still 20x slower on 10000x10000. I did the following on a g2.2xlarge
> instance with BIDMat:
>
> val n = 10000
>
> val f = rand(n, n)
> flip; f*f; val rf = flop
>
> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop
>
> flip; g*g; val rgg = flop
>
> The CPU version finished in 12 seconds.
> The CPU->GPU->CPU version finished in 2.2 seconds.
> The GPU version finished in 1.7 seconds.
>
> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
> path. But based on the result, the data copying overhead is definitely
> not as big as 20x at n = 10000.
>
> Best,
> Xiangrui
>
>
> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <sam.halli...@gmail.com>
> wrote:
> > I've had some email exchanges with the author of BIDMat: it does exactly
> > what you need to get the GPU benefit and writes higher level algorithms
> > entirely in the GPU kernels so that the memory stays there as long as
> > possible. The restriction with this approach is that it is only offering
> > high-level algorithms so is not a toolkit for applied mathematics
> > research and development --- but it works well as a toolkit for higher
> > level analysis (e.g. for analysts and practitioners).
> >
> > I believe BIDMat's approach is the best way to get performance out of
> > GPU hardware at the moment but I also have strong evidence to suggest
> > that the hardware will catch up and the memory transfer costs between
> > CPU/GPU will disappear meaning that there will be no need for custom GPU
> > kernel implementations. i.e. please continue to use BLAS primitives when
> > writing new algorithms and only go to the GPU for an alternative
> > optimised implementation.
> >
> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
> > an API that looks like BLAS but takes pointers to special regions in the
> > GPU memory region. Somebody has written a wrapper around CUDA to create
> > a proper BLAS library but it only gives marginal performance over the
> > CPU because of the memory transfer overhead.
> >
> > This slide from my talk
> >
> >   http://fommil.github.io/scalax14/#/11/2
> >
> > says it all. X axis is matrix size, Y axis is logarithmic time to do
> > DGEMM. Black line is the "cheating" time for the GPU and the green line
> > is after copying the memory to/from the GPU memory. APUs have the
> > potential to eliminate the green line.
> >
> > Best regards,
> > Sam
> >
> >
> >
> > "Ulanov, Alexander" <alexander.ula...@hp.com> writes:
> >
> >> Evan, thank you for the summary. I would like to add some more
> observations. The GPU that I used is 2.5 times cheaper than the CPU ($250
> vs $100). They both are 3 years old. I've also did a small test with modern
> hardware, and the new GPU nVidia Titan was slightly more than 1 order of
> magnitude faster than Intel E5-2650 v2 for the same tests. However, it
> costs as much as CPU ($1200). My takeaway is that GPU is making a better
> price/value progress.
> >>
> >>
> >>
> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
> netlib-cuda and the most reasonable explanation is that it holds the result
> in GPU memory, as Sam suggested. At the same time, it is OK because you can
> copy the result back from GPU only when needed. However, to be sure, I am
> going to ask the developer of BIDMat on his upcoming talk.
> >>
> >>
> >>
> >> Best regards, Alexander
> >>
> >>
> >> From: Sam Halliday [mailto:sam.halli...@gmail.com]
> >> Sent: Thursday, February 26, 2015 1:56 PM
> >> To: Xiangrui Meng
> >> Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R.
> Sparks
> >> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>
> >>
> >> Btw, I wish people would stop cheating when comparing CPU and GPU
> timings for things like matrix multiply :-P
> >>
> >> Please always compare apples with apples and include the time it takes
> to set up the matrices, send it to the processing unit, doing the
> calculation AND copying it back to where you need to see the results.
> >>
> >> Ignoring this method will make you believe that your GPU is thousands
> of times faster than it really is. Again, jump to the end of my talk for
> graphs and more discussion....  especially the bit about me being keen on
> funding to investigate APU hardware further ;-) (I believe it will solve
> the problem)
> >> On 26 Feb 2015 21:16, "Xiangrui Meng" <men...@gmail.com<mailto:
> men...@gmail.com>> wrote:
> >> Hey Alexander,
> >>
> >> I don't quite understand the part where netlib-cublas is about 20x
> >> slower than netlib-openblas. What is the overhead of using a GPU BLAS
> >> with netlib-java?
> >>
> >> CC'ed Sam, the author of netlib-java.
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley <jos...@databricks.com
> <mailto:jos...@databricks.com>> wrote:
> >>> Better documentation for linking would be very helpful!  Here's a JIRA:
> >>> https://issues.apache.org/jira/browse/SPARK-6019
> >>>
> >>>
> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks <evan.spa...@gmail.com
> <mailto:evan.spa...@gmail.com>>
> >>> wrote:
> >>>
> >>>> Thanks for compiling all the data and running these benchmarks, Alex.
> The
> >>>> big takeaways here can be seen with this chart:
> >>>>
> >>>>
> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
> >>>>
> >>>> 1) A properly configured GPU matrix multiply implementation (e.g.
> >>>> BIDMat+GPU) can provide substantial (but less than an order of
> magnitude)
> >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
> >>>> netlib-java+openblas-compiled).
> >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
> worse
> >>>> than a well-tuned CPU implementation, particularly for larger
> matrices.
> >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
> >>>> basically agrees with the authors own benchmarks (
> >>>> https://github.com/fommil/netlib-java)
> >>>>
> >>>> I think that most of our users are in a situation where using GPUs
> may not
> >>>> be practical - although we could consider having a good GPU backend
> >>>> available as an option. However, *ALL* users of MLlib could benefit
> >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS
> >>>> implementation. Perhaps we should consider updating the mllib guide
> with a
> >>>> more complete section for enabling high performance binaries on OSX
> and
> >>>> Linux? Or better, figure out a way for the system to fetch these
> >>>> automatically.
> >>>>
> >>>> - Evan
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
> >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote:
> >>>>
> >>>>> Just to summarize this thread, I was finally able to make all
> performance
> >>>>> comparisons that we discussed. It turns out that:
> >>>>> BIDMat-cublas>>BIDMat
> >>>>>
> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
> >>>>>
> >>>>> Below is the link to the spreadsheet with full results.
> >>>>>
> >>>>>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
> >>>>>
> >>>>> One thing still needs exploration: does BIDMat-cublas perform copying
> >>>>> to/from machine’s RAM?
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Ulanov, Alexander
> >>>>> Sent: Tuesday, February 10, 2015 2:12 PM
> >>>>> To: Evan R. Sparks
> >>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org
> >
> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the
> >>>>> original one discusses slightly different topic. I was able to link
> netlib
> >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked
> inside a
> >>>>> 60MB library.
> >>>>>
> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> >>>>>
> +-----------------------------------------------------------------------+
> >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557
> >>>>> |1,638475459 |
> >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 |
> >>>>> 1569,233228 |
> >>>>>
> >>>>> It turn out that pre-compiled MKL is faster than precompiled
> OpenBlas on
> >>>>> my machine. Probably, I’ll add two more columns with locally compiled
> >>>>> openblas and cuda.
> >>>>>
> >>>>> Alexander
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com>]
> >>>>> Sent: Monday, February 09, 2015 6:06 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org
> >
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Great - perhaps we can move this discussion off-list and onto a JIRA
> >>>>> ticket? (Here's one:
> https://issues.apache.org/jira/browse/SPARK-5705)
> >>>>>
> >>>>> It seems like this is going to be somewhat exploratory for a while
> (and
> >>>>> there's probably only a handful of us who really care about fast
> linear
> >>>>> algebra!)
> >>>>>
> >>>>> - Evan
> >>>>>
> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander <
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>>> Hi Evan,
> >>>>>
> >>>>> Thank you for explanation and useful link. I am going to build
> OpenBLAS,
> >>>>> link it with Netlib-java and perform benchmark again.
> >>>>>
> >>>>> Do I understand correctly that BIDMat binaries contain statically
> linked
> >>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat
> not
> >>>>> having MKL BLAS installed on my server. If it is true, I wonder if
> it is OK
> >>>>> because Intel sells this library. Nevertheless, it seems that in my
> case
> >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given
> that
> >>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
> >>>>>
> >>>>> Though, it might be interesting to link Netlib-java with Intel MKL,
> as
> >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday
> >>>>> (Netlib-java) interested to compare their libraries.
> >>>>>
> >>>>> Best regards, Alexander
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >>>>> Sent: Friday, February 06, 2015 5:58 PM
> >>>>>
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org
> ><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> I would build OpenBLAS yourself, since good BLAS performance comes
> from
> >>>>> getting cache sizes, etc. set up correctly for your particular
> hardware -
> >>>>> this is often a very tricky process (see, e.g. ATLAS), but we found
> that on
> >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> >>>>> performance competitive with MKL.
> >>>>>
> >>>>> To make sure the right library is getting used, you have to make sure
> >>>>> it's first on the search path - export
> >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here.
> >>>>>
> >>>>> For some examples of getting netlib-java setup on an ec2 node and
> some
> >>>>> example benchmarking code we ran a while back, see:
> >>>>> https://github.com/shivaram/matrix-bench
> >>>>>
> >>>>> In particular - build-openblas-ec2.sh shows you how to build the
> library
> >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how
> to get
> >>>>> the path setup and get that library picked up by netlib-java.
> >>>>>
> >>>>> In this way - you could probably get cuBLAS set up to be used by
> >>>>> netlib-java as well.
> >>>>>
> >>>>> - Evan
> >>>>>
> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander <
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to
> force
> >>>>> loading the right blas? For netlib, I there are few JVM flags, such
> as
> >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so
> I can
> >>>>> force it to use Java implementation. Not sure I understand how to
> force use
> >>>>> a specific blas (not specific wrapper for blas).
> >>>>>
> >>>>> Btw. I have installed openblas (yum install openblas), so I suppose
> that
> >>>>> netlib is using it.
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >>>>> Sent: Friday, February 06, 2015 5:19 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org
> ><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >>>>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Getting breeze to pick up the right blas library is critical for
> >>>>> performance. I recommend using OpenBLAS (or MKL, if you already have
> it).
> >>>>> It might make sense to force BIDMat to use the same underlying BLAS
> library
> >>>>> as well.
> >>>>>
> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander <
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>>> Hi Evan, Joseph
> >>>>>
> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x
> faster
> >>>>> than netlib-java+breeze (sorry for weird table formatting):
> >>>>>
> >>>>> |A*B  size | BIDMat MKL | Breeze+Netlib-java
> native_system_linux_x86-64|
> >>>>> Breeze+Netlib-java f2jblas |
> >>>>>
> +-----------------------------------------------------------------------+
> >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 |
> >>>>>
> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora
> 19
> >>>>> Linux, Scala 2.11.
> >>>>>
> >>>>> Later I will make tests with Cuda. I need to install new Cuda
> version for
> >>>>> this purpose.
> >>>>>
> >>>>> Do you have any ideas why breeze-netlib with native blas is so much
> >>>>> slower than BIDMat MKL?
> >>>>>
> >>>>> Best regards, Alexander
> >>>>>
> >>>>> From: Joseph Bradley [mailto:jos...@databricks.com<mailto:
> jos...@databricks.com><mailto:
> >>>>> jos...@databricks.com<mailto:jos...@databricks.com>>]
> >>>>> Sent: Thursday, February 05, 2015 5:29 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: Evan R. Sparks; dev@spark.apache.org<mailto:dev@spark.apache.org
> ><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> Hi Alexander,
> >>>>>
> >>>>> Using GPUs with Spark would be very exciting.  Small comment:
> Concerning
> >>>>> your question earlier about keeping data stored on the GPU rather
> than
> >>>>> having to move it between main memory and GPU memory on each
> iteration, I
> >>>>> would guess this would be critical to getting good performance.  If
> you
> >>>>> could do multiple local iterations before aggregating results, then
> the
> >>>>> cost of data movement to the GPU could be amortized (and I believe
> that is
> >>>>> done in practice).  Having Spark be aware of the GPU and using it as
> >>>>> another part of memory sounds like a much bigger undertaking.
> >>>>>
> >>>>> Joseph
> >>>>>
> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander <
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> wrote:
> >>>>> Thank you for explanation! I’ve watched the BIDMach presentation by
> John
> >>>>> Canny and I am really inspired by his talk and comparisons with
> Spark MLlib.
> >>>>>
> >>>>> I am very interested to find out what will be better within Spark:
> BIDMat
> >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way
> to
> >>>>> benchmark them? Currently I do benchmarks on artificial neural
> networks in
> >>>>> batch mode. While it is not a “pure” test of linear algebra, it
> involves
> >>>>> some other things that are essential to machine learning.
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>]
> >>>>> Sent: Thursday, February 05, 2015 1:29 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to
> data
> >>>>> layout and fewer levels of indirection - it's definitely a worthwhile
> >>>>> experiment to run. The main speedups I've seen from using it come
> from
> >>>>> highly optimized GPU code for linear algebra. I know that in the
> past Canny
> >>>>> has gone as far as to write custom GPU kernels for
> performance-critical
> >>>>> regions of code.[1]
> >>>>>
> >>>>> BIDMach is highly optimized for single node performance or
> performance on
> >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or
> can be
> >>>>> batched in that way) the performance tends to fall off. Canny argues
> for
> >>>>> hardware/software codesign and as such prefers machine
> configurations that
> >>>>> are quite different than what we find in most commodity cluster
> nodes -
> >>>>> e.g. 10 disk cahnnels and 4 GPUs.
> >>>>>
> >>>>> In contrast, MLlib was designed for horizontal scalability on
> commodity
> >>>>> clusters and works best on very big datasets - order of terabytes.
> >>>>>
> >>>>> For the most part, these projects developed concurrently to address
> >>>>> slightly different use cases. That said, there may be bits of
> BIDMach we
> >>>>> could repurpose for MLlib - keep in mind we need to be careful about
> >>>>> maintaining cross-language compatibility for our Java and
> Python-users,
> >>>>> though.
> >>>>>
> >>>>> - Evan
> >>>>>
> >>>>> [1] - http://arxiv.org/abs/1409.5402
> >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
> >>>>>
> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander <
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote:
> >>>>> Hi Evan,
> >>>>>
> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you
> >>>>> know what makes them faster than netlib-java?
> >>>>>
> >>>>> The same group has BIDMach library that implements machine learning.
> For
> >>>>> some examples they use Caffe convolutional neural network library
> owned by
> >>>>> another group in Berkeley. Could you elaborate on how these all
> might be
> >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra
> why don’t
> >>>>> you take BIDMach for optimization and learning?
> >>>>>
> >>>>> Best regards, Alexander
> >>>>>
> >>>>> From: Evan R. Sparks [mailto:evan.spa...@gmail.com<mailto:
> evan.spa...@gmail.com><mailto:
> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:
> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto:
> >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>]
> >>>>> Sent: Thursday, February 05, 2015 12:09 PM
> >>>>> To: Ulanov, Alexander
> >>>>> Cc: dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto:
> >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:
> dev@spark.apache.org<mailto:dev@spark.apache.org>>>
> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra
> >>>>>
> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU
> blas in
> >>>>> many cases.
> >>>>>
> >>>>> You might consider taking a look at the codepaths that BIDMat (
> >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to
> >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work
> optimizing
> >>>>> to make this work really fast from Scala. I've run it on my laptop
> and
> >>>>> compared to MKL and in certain cases it's 10x faster at matrix
> multiply.
> >>>>> There are a lot of layers of indirection here and you really want to
> avoid
> >>>>> data copying as much as possible.
> >>>>>
> >>>>> We could also consider swapping out BIDMat for Breeze, but that
> would be
> >>>>> a big project and if we can figure out how to get breeze+cublas to
> >>>>> comparable performance that would be a big win.
> >>>>>
> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto:
> >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:
> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> wrote:
> >>>>> Dear Spark developers,
> >>>>>
> >>>>> I am exploring how to make linear algebra operations faster within
> Spark.
> >>>>> One way of doing this is to use Scala Breeze library that is bundled
> with
> >>>>> Spark. For matrix operations, it employs Netlib-java that has a Java
> >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> >>>>> binaries if they are available on the worker node. It also has its
> own
> >>>>> optimized Java implementation of BLAS. It is worth mentioning, that
> native
> >>>>> binaries provide better performance only for BLAS level 3, i.e.
> >>>>> matrix-matrix operations or general matrix multiplication (GEMM).
> This is
> >>>>> confirmed by GEMM test on Netlib-java page
> >>>>> https://github.com/fommil/netlib-java. I also confirmed it with my
> >>>>> experiments with training of artificial neural network
> >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952.
> >>>>> However, I would like to boost performance more.
> >>>>>
> >>>>> GPU is supposed to work fast with linear algebra and there is Nvidia
> CUDA
> >>>>> implementation of BLAS, called cublas. I have one Linux server with
> Nvidia
> >>>>> GPU and I was able to do the following. I linked cublas (instead of
> >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> >>>>> Breeze/Netlib is using it. Then I did some performance measurements
> with
> >>>>> regards to artificial neural network batch learning in Spark MLlib
> that
> >>>>> involves matrix-matrix multiplications. It turns out that for
> matrices of
> >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas.
> Cublas
> >>>>> becomes slower for bigger matrices. It worth mentioning that it is
> was not
> >>>>> a test for ONLY multiplication since there are other operations
> involved.
> >>>>> One of the reasons for slowdown might be the overhead of copying the
> >>>>> matrices from computer memory to graphic card memory and back.
> >>>>>
> >>>>> So, few questions:
> >>>>> 1) Do these results with CUDA make sense?
> >>>>> 2) If the problem is with copy overhead, are there any libraries that
> >>>>> allow to force intermediate results to stay in graphic card memory
> thus
> >>>>> removing the overhead?
> >>>>> 3) Any other options to speed-up linear algebra in Spark?
> >>>>>
> >>>>> Thank you, Alexander
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org><mailto:
> >>>>> dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org>><mailto:dev-unsubscr...@spark.apache.org
> <mailto:dev-unsubscr...@spark.apache.org>
> >>>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:
> dev-unsubscr...@spark.apache.org>>>
> >>>>> For additional commands, e-mail: dev-h...@spark.apache.org<mailto:
> dev-h...@spark.apache.org><mailto:
> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto:
> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto:
> >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >
> > --
> > Best regards,
> > Sam
> >
>

Reply via email to