The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 10000x10000. I did the following on a g2.2xlarge instance with BIDMat:
val n = 10000 val f = rand(n, n) flip; f*f; val rf = flop flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop flip; g*g; val rgg = flop The CPU version finished in 12 seconds. The CPU->GPU->CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas path. But based on the result, the data copying overhead is definitely not as big as 20x at n = 10000. Best, Xiangrui On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday <sam.halli...@gmail.com> wrote: > I've had some email exchanges with the author of BIDMat: it does exactly > what you need to get the GPU benefit and writes higher level algorithms > entirely in the GPU kernels so that the memory stays there as long as > possible. The restriction with this approach is that it is only offering > high-level algorithms so is not a toolkit for applied mathematics > research and development --- but it works well as a toolkit for higher > level analysis (e.g. for analysts and practitioners). > > I believe BIDMat's approach is the best way to get performance out of > GPU hardware at the moment but I also have strong evidence to suggest > that the hardware will catch up and the memory transfer costs between > CPU/GPU will disappear meaning that there will be no need for custom GPU > kernel implementations. i.e. please continue to use BLAS primitives when > writing new algorithms and only go to the GPU for an alternative > optimised implementation. > > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer > an API that looks like BLAS but takes pointers to special regions in the > GPU memory region. Somebody has written a wrapper around CUDA to create > a proper BLAS library but it only gives marginal performance over the > CPU because of the memory transfer overhead. > > This slide from my talk > > http://fommil.github.io/scalax14/#/11/2 > > says it all. X axis is matrix size, Y axis is logarithmic time to do > DGEMM. Black line is the "cheating" time for the GPU and the green line > is after copying the memory to/from the GPU memory. APUs have the > potential to eliminate the green line. > > Best regards, > Sam > > > > "Ulanov, Alexander" <alexander.ula...@hp.com> writes: > >> Evan, thank you for the summary. I would like to add some more observations. >> The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They >> both are 3 years old. I've also did a small test with modern hardware, and >> the new GPU nVidia Titan was slightly more than 1 order of magnitude faster >> than Intel E5-2650 v2 for the same tests. However, it costs as much as CPU >> ($1200). My takeaway is that GPU is making a better price/value progress. >> >> >> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than netlib-cuda >> and the most reasonable explanation is that it holds the result in GPU >> memory, as Sam suggested. At the same time, it is OK because you can copy >> the result back from GPU only when needed. However, to be sure, I am going >> to ask the developer of BIDMat on his upcoming talk. >> >> >> >> Best regards, Alexander >> >> >> From: Sam Halliday [mailto:sam.halli...@gmail.com] >> Sent: Thursday, February 26, 2015 1:56 PM >> To: Xiangrui Meng >> Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks >> Subject: Re: Using CUDA within Spark / boosting linear algebra >> >> >> Btw, I wish people would stop cheating when comparing CPU and GPU timings >> for things like matrix multiply :-P >> >> Please always compare apples with apples and include the time it takes to >> set up the matrices, send it to the processing unit, doing the calculation >> AND copying it back to where you need to see the results. >> >> Ignoring this method will make you believe that your GPU is thousands of >> times faster than it really is. Again, jump to the end of my talk for graphs >> and more discussion.... especially the bit about me being keen on funding >> to investigate APU hardware further ;-) (I believe it will solve the problem) >> On 26 Feb 2015 21:16, "Xiangrui Meng" >> <men...@gmail.com<mailto:men...@gmail.com>> wrote: >> Hey Alexander, >> >> I don't quite understand the part where netlib-cublas is about 20x >> slower than netlib-openblas. What is the overhead of using a GPU BLAS >> with netlib-java? >> >> CC'ed Sam, the author of netlib-java. >> >> Best, >> Xiangrui >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley >> <jos...@databricks.com<mailto:jos...@databricks.com>> wrote: >>> Better documentation for linking would be very helpful! Here's a JIRA: >>> https://issues.apache.org/jira/browse/SPARK-6019 >>> >>> >>> On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks >>> <evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>> >>> wrote: >>> >>>> Thanks for compiling all the data and running these benchmarks, Alex. The >>>> big takeaways here can be seen with this chart: >>>> >>>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive >>>> >>>> 1) A properly configured GPU matrix multiply implementation (e.g. >>>> BIDMat+GPU) can provide substantial (but less than an order of magnitude) >>>> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or >>>> netlib-java+openblas-compiled). >>>> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse >>>> than a well-tuned CPU implementation, particularly for larger matrices. >>>> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this >>>> basically agrees with the authors own benchmarks ( >>>> https://github.com/fommil/netlib-java) >>>> >>>> I think that most of our users are in a situation where using GPUs may not >>>> be practical - although we could consider having a good GPU backend >>>> available as an option. However, *ALL* users of MLlib could benefit >>>> (potentially tremendously) from using a well-tuned CPU-based BLAS >>>> implementation. Perhaps we should consider updating the mllib guide with a >>>> more complete section for enabling high performance binaries on OSX and >>>> Linux? Or better, figure out a way for the system to fetch these >>>> automatically. >>>> >>>> - Evan >>>> >>>> >>>> >>>> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander < >>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>> wrote: >>>> >>>>> Just to summarize this thread, I was finally able to make all performance >>>>> comparisons that we discussed. It turns out that: >>>>> BIDMat-cublas>>BIDMat >>>>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas >>>>> >>>>> Below is the link to the spreadsheet with full results. >>>>> >>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing >>>>> >>>>> One thing still needs exploration: does BIDMat-cublas perform copying >>>>> to/from machine’s RAM? >>>>> >>>>> -----Original Message----- >>>>> From: Ulanov, Alexander >>>>> Sent: Tuesday, February 10, 2015 2:12 PM >>>>> To: Evan R. Sparks >>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org> >>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra >>>>> >>>>> Thanks, Evan! It seems that ticket was marked as duplicate though the >>>>> original one discusses slightly different topic. I was able to link netlib >>>>> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a >>>>> 60MB library. >>>>> >>>>> |A*B size | BIDMat MKL | Breeze+Netlib-MKL from BIDMat| >>>>> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas | >>>>> +-----------------------------------------------------------------------+ >>>>> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 | >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 >>>>> |1,638475459 | >>>>> |10000x10000*10000x10000 | 23,78046632 | 32,94546697 |445,0935211 | >>>>> 1569,233228 | >>>>> >>>>> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on >>>>> my machine. Probably, I’ll add two more columns with locally compiled >>>>> openblas and cuda. >>>>> >>>>> Alexander >>>>> >>>>> From: Evan R. Sparks >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>] >>>>> Sent: Monday, February 09, 2015 6:06 PM >>>>> To: Ulanov, Alexander >>>>> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>>> >>>>> Great - perhaps we can move this discussion off-list and onto a JIRA >>>>> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705) >>>>> >>>>> It seems like this is going to be somewhat exploratory for a while (and >>>>> there's probably only a handful of us who really care about fast linear >>>>> algebra!) >>>>> >>>>> - Evan >>>>> >>>>> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander < >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>>> wrote: >>>>> Hi Evan, >>>>> >>>>> Thank you for explanation and useful link. I am going to build OpenBLAS, >>>>> link it with Netlib-java and perform benchmark again. >>>>> >>>>> Do I understand correctly that BIDMat binaries contain statically linked >>>>> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not >>>>> having MKL BLAS installed on my server. If it is true, I wonder if it is >>>>> OK >>>>> because Intel sells this library. Nevertheless, it seems that in my case >>>>> precompiled MKL BLAS performs better than precompiled OpenBLAS given that >>>>> BIDMat and Netlib-java are supposed to be on par with JNI overheads. >>>>> >>>>> Though, it might be interesting to link Netlib-java with Intel MKL, as >>>>> you suggested. I wonder, are John Canny (BIDMat) and Sam Halliday >>>>> (Netlib-java) interested to compare their libraries. >>>>> >>>>> Best regards, Alexander >>>>> >>>>> From: Evan R. Sparks >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>>> Sent: Friday, February 06, 2015 5:58 PM >>>>> >>>>> To: Ulanov, Alexander >>>>> Cc: Joseph Bradley; >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>>> >>>>> I would build OpenBLAS yourself, since good BLAS performance comes from >>>>> getting cache sizes, etc. set up correctly for your particular hardware - >>>>> this is often a very tricky process (see, e.g. ATLAS), but we found that >>>>> on >>>>> relatively modern Xeon chips, OpenBLAS builds quickly and yields >>>>> performance competitive with MKL. >>>>> >>>>> To make sure the right library is getting used, you have to make sure >>>>> it's first on the search path - export >>>>> LD_LIBRARY_PATH=/path/to/blas/library.so will do the trick here. >>>>> >>>>> For some examples of getting netlib-java setup on an ec2 node and some >>>>> example benchmarking code we ran a while back, see: >>>>> https://github.com/shivaram/matrix-bench >>>>> >>>>> In particular - build-openblas-ec2.sh shows you how to build the library >>>>> and set up symlinks correctly, and scala/run-netlib.sh shows you how to >>>>> get >>>>> the path setup and get that library picked up by netlib-java. >>>>> >>>>> In this way - you could probably get cuBLAS set up to be used by >>>>> netlib-java as well. >>>>> >>>>> - Evan >>>>> >>>>> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander < >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>>> wrote: >>>>> Evan, could you elaborate on how to force BIDMat and netlib-java to force >>>>> loading the right blas? For netlib, I there are few JVM flags, such as >>>>> -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can >>>>> force it to use Java implementation. Not sure I understand how to force >>>>> use >>>>> a specific blas (not specific wrapper for blas). >>>>> >>>>> Btw. I have installed openblas (yum install openblas), so I suppose that >>>>> netlib is using it. >>>>> >>>>> From: Evan R. Sparks >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>>> Sent: Friday, February 06, 2015 5:19 PM >>>>> To: Ulanov, Alexander >>>>> Cc: Joseph Bradley; >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >>>>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>>> >>>>> Getting breeze to pick up the right blas library is critical for >>>>> performance. I recommend using OpenBLAS (or MKL, if you already have it). >>>>> It might make sense to force BIDMat to use the same underlying BLAS >>>>> library >>>>> as well. >>>>> >>>>> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander < >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>>> wrote: >>>>> Hi Evan, Joseph >>>>> >>>>> I did few matrix multiplication test and BIDMat seems to be ~10x faster >>>>> than netlib-java+breeze (sorry for weird table formatting): >>>>> >>>>> |A*B size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| >>>>> Breeze+Netlib-java f2jblas | >>>>> +-----------------------------------------------------------------------+ >>>>> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 | >>>>> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 | >>>>> |10000x10000*10000x10000 | 23,78046632 | 445,0935211 | 1569,233228 | >>>>> >>>>> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 >>>>> Linux, Scala 2.11. >>>>> >>>>> Later I will make tests with Cuda. I need to install new Cuda version for >>>>> this purpose. >>>>> >>>>> Do you have any ideas why breeze-netlib with native blas is so much >>>>> slower than BIDMat MKL? >>>>> >>>>> Best regards, Alexander >>>>> >>>>> From: Joseph Bradley >>>>> [mailto:jos...@databricks.com<mailto:jos...@databricks.com><mailto: >>>>> jos...@databricks.com<mailto:jos...@databricks.com>>] >>>>> Sent: Thursday, February 05, 2015 5:29 PM >>>>> To: Ulanov, Alexander >>>>> Cc: Evan R. Sparks; >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>>> >>>>> Hi Alexander, >>>>> >>>>> Using GPUs with Spark would be very exciting. Small comment: Concerning >>>>> your question earlier about keeping data stored on the GPU rather than >>>>> having to move it between main memory and GPU memory on each iteration, I >>>>> would guess this would be critical to getting good performance. If you >>>>> could do multiple local iterations before aggregating results, then the >>>>> cost of data movement to the GPU could be amortized (and I believe that is >>>>> done in practice). Having Spark be aware of the GPU and using it as >>>>> another part of memory sounds like a much bigger undertaking. >>>>> >>>>> Joseph >>>>> >>>>> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander < >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>> >>>>> wrote: >>>>> Thank you for explanation! I’ve watched the BIDMach presentation by John >>>>> Canny and I am really inspired by his talk and comparisons with Spark >>>>> MLlib. >>>>> >>>>> I am very interested to find out what will be better within Spark: BIDMat >>>>> or netlib-java with CPU or GPU natives. Could you suggest a fair way to >>>>> benchmark them? Currently I do benchmarks on artificial neural networks in >>>>> batch mode. While it is not a “pure” test of linear algebra, it involves >>>>> some other things that are essential to machine learning. >>>>> >>>>> From: Evan R. Sparks >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>] >>>>> Sent: Thursday, February 05, 2015 1:29 PM >>>>> To: Ulanov, Alexander >>>>> Cc: >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>>> >>>>> I'd be surprised of BIDMat+OpenBLAS was significantly faster than >>>>> netlib-java+OpenBLAS, but if it is much faster it's probably due to data >>>>> layout and fewer levels of indirection - it's definitely a worthwhile >>>>> experiment to run. The main speedups I've seen from using it come from >>>>> highly optimized GPU code for linear algebra. I know that in the past >>>>> Canny >>>>> has gone as far as to write custom GPU kernels for performance-critical >>>>> regions of code.[1] >>>>> >>>>> BIDMach is highly optimized for single node performance or performance on >>>>> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be >>>>> batched in that way) the performance tends to fall off. Canny argues for >>>>> hardware/software codesign and as such prefers machine configurations that >>>>> are quite different than what we find in most commodity cluster nodes - >>>>> e.g. 10 disk cahnnels and 4 GPUs. >>>>> >>>>> In contrast, MLlib was designed for horizontal scalability on commodity >>>>> clusters and works best on very big datasets - order of terabytes. >>>>> >>>>> For the most part, these projects developed concurrently to address >>>>> slightly different use cases. That said, there may be bits of BIDMach we >>>>> could repurpose for MLlib - keep in mind we need to be careful about >>>>> maintaining cross-language compatibility for our Java and Python-users, >>>>> though. >>>>> >>>>> - Evan >>>>> >>>>> [1] - http://arxiv.org/abs/1409.5402 >>>>> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf >>>>> >>>>> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander < >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> >>>>> wrote: >>>>> Hi Evan, >>>>> >>>>> Thank you for suggestion! BIDMat seems to have terrific speed. Do you >>>>> know what makes them faster than netlib-java? >>>>> >>>>> The same group has BIDMach library that implements machine learning. For >>>>> some examples they use Caffe convolutional neural network library owned by >>>>> another group in Berkeley. Could you elaborate on how these all might be >>>>> connected with Spark Mllib? If you take BIDMat for linear algebra why >>>>> don’t >>>>> you take BIDMach for optimization and learning? >>>>> >>>>> Best regards, Alexander >>>>> >>>>> From: Evan R. Sparks >>>>> [mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>><mailto:evan.spa...@gmail.com<mailto:evan.spa...@gmail.com><mailto: >>>>> evan.spa...@gmail.com<mailto:evan.spa...@gmail.com>>>] >>>>> Sent: Thursday, February 05, 2015 12:09 PM >>>>> To: Ulanov, Alexander >>>>> Cc: >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>><mailto: >>>>> dev@spark.apache.org<mailto:dev@spark.apache.org><mailto:dev@spark.apache.org<mailto:dev@spark.apache.org>>> >>>>> Subject: Re: Using CUDA within Spark / boosting linear algebra >>>>> >>>>> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in >>>>> many cases. >>>>> >>>>> You might consider taking a look at the codepaths that BIDMat ( >>>>> https://github.com/BIDData/BIDMat) takes and comparing them to >>>>> netlib-java/breeze. John Canny et. al. have done a bunch of work >>>>> optimizing >>>>> to make this work really fast from Scala. I've run it on my laptop and >>>>> compared to MKL and in certain cases it's 10x faster at matrix multiply. >>>>> There are a lot of layers of indirection here and you really want to avoid >>>>> data copying as much as possible. >>>>> >>>>> We could also consider swapping out BIDMat for Breeze, but that would be >>>>> a big project and if we can figure out how to get breeze+cublas to >>>>> comparable performance that would be a big win. >>>>> >>>>> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander < >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>><mailto: >>>>> alexander.ula...@hp.com<mailto:alexander.ula...@hp.com><mailto:alexander.ula...@hp.com<mailto:alexander.ula...@hp.com>>>> >>>>> wrote: >>>>> Dear Spark developers, >>>>> >>>>> I am exploring how to make linear algebra operations faster within Spark. >>>>> One way of doing this is to use Scala Breeze library that is bundled with >>>>> Spark. For matrix operations, it employs Netlib-java that has a Java >>>>> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native >>>>> binaries if they are available on the worker node. It also has its own >>>>> optimized Java implementation of BLAS. It is worth mentioning, that native >>>>> binaries provide better performance only for BLAS level 3, i.e. >>>>> matrix-matrix operations or general matrix multiplication (GEMM). This is >>>>> confirmed by GEMM test on Netlib-java page >>>>> https://github.com/fommil/netlib-java. I also confirmed it with my >>>>> experiments with training of artificial neural network >>>>> https://github.com/apache/spark/pull/1290#issuecomment-70313952. >>>>> However, I would like to boost performance more. >>>>> >>>>> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA >>>>> implementation of BLAS, called cublas. I have one Linux server with Nvidia >>>>> GPU and I was able to do the following. I linked cublas (instead of >>>>> cpu-based blas) with Netlib-java wrapper and put it into Spark, so >>>>> Breeze/Netlib is using it. Then I did some performance measurements with >>>>> regards to artificial neural network batch learning in Spark MLlib that >>>>> involves matrix-matrix multiplications. It turns out that for matrices of >>>>> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas >>>>> becomes slower for bigger matrices. It worth mentioning that it is was not >>>>> a test for ONLY multiplication since there are other operations involved. >>>>> One of the reasons for slowdown might be the overhead of copying the >>>>> matrices from computer memory to graphic card memory and back. >>>>> >>>>> So, few questions: >>>>> 1) Do these results with CUDA make sense? >>>>> 2) If the problem is with copy overhead, are there any libraries that >>>>> allow to force intermediate results to stay in graphic card memory thus >>>>> removing the overhead? >>>>> 3) Any other options to speed-up linear algebra in Spark? >>>>> >>>>> Thank you, Alexander >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: >>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org><mailto: >>>>> dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>><mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org> >>>>> <mailto:dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>>> >>>>> For additional commands, e-mail: >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>><mailto:dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org><mailto: >>>>> dev-h...@spark.apache.org<mailto:dev-h...@spark.apache.org>>> >>>>> >>>>> >>>>> >>>>> >>>> > > -- > Best regards, > Sam > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org