Re: trouble with sbt building network-* projects?

2015-02-27 Thread Ted Yu
bq. I have to keep cd'ing into network/common, run mvn install, then go
back to network/shuffle and run some other mvn command over there.

Yeah - been through this.

Having continuous testing for maven would be nice.

On Fri, Feb 27, 2015 at 11:31 AM, Imran Rashid  wrote:

> well, perhaps I just need to learn to use maven better, but currently I
> find sbt much more convenient for continuously running my tests.  I do use
> zinc, but I'm looking for continuous testing.  This makes me think I need
> sbt for that:
> http://stackoverflow.com/questions/11347633/is-there-a-java-continuous-testing-plugin-for-maven
>
> 1) I really like that in sbt I can run "~test-only
> com.foo.bar.SomeTestSuite" (or whatever other pattern) and just leave that
> running as I code, without having to go and explicitly trigger "mvn test"
> and wait for the result.
>
> 2) I find sbt's handling of sub-projects much simpler (when it works).
> I'm trying to make changes to network/common & network/shuffle, which means
> I have to keep cd'ing into network/common, run mvn install, then go back to
> network/shuffle and run some other mvn command over there.  I don't want to
> run mvn at the root project level, b/c I don't want to wait for it to
> compile all the other projects when I just want to run tests in
> network/common.  Even with incremental compiling, in my day-to-day coding I
> want to entirely skip compiling sql, graphx, mllib etc. -- I have to switch
> branches often enough that i end up triggering a full rebuild of those
> projects even when I haven't touched them.
>
>
>
>
>
> On Fri, Feb 27, 2015 at 1:14 PM, Ted Yu  wrote:
>
>> bq. to be able to run my tests in sbt, though, it makes the development
>> iterations much faster.
>>
>> Was the preference for sbt due to long maven build time ?
>> Have you started Zinc on your machine ?
>>
>> Cheers
>>
>> On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid 
>> wrote:
>>
>>> Has anyone else noticed very strange build behavior in the network-*
>>> projects?
>>>
>>> maven seems to the doing the right, but sbt is very inconsistent.
>>> Sometimes when it builds network-shuffle it doesn't know about any of the
>>> code in network-common.  Sometimes it will completely skip the java unit
>>> tests.  And then some time later, it'll suddenly decide it knows about
>>> some
>>> more of the java unit tests.  Its not from a simple change, like
>>> touching a
>>> test file, or a file the test depends on -- nor a restart of sbt.  I am
>>> pretty confused.
>>>
>>>
>>> maven had issues when I tried to add scala code to network-common, it
>>> would
>>> compile the scala code but not make it available to java.  I'm working
>>> around that by just coding in java anyhow.  I'd really like to be able to
>>> run my tests in sbt, though, it makes the development iterations much
>>> faster.
>>>
>>> thanks,
>>> Imran
>>>
>>
>>
>


Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Sam Halliday
Also, check the JNILoader output.

Remember, for netlib-java to use your system libblas all you need to do is
setup libblas.so.3 like any native application would expect.

I haven't ever used the cublas "real BLAS"  implementation, so I'd be
interested to hear about this. Do an 'ldd /usr/lib/libblas.so.3' to check
that all the runtime links are in order.

Btw, I have some DGEMM wrappers in my netlib-java performance module... and
I also planned to write more in MultiBLAS (until I mothballed the project
for the hardware to catch up, which is probably has and now I just need a
reason to look at it)
 On 27 Feb 2015 20:26, "Xiangrui Meng"  wrote:

> Hey Sam,
>
> The running times are not "big O" estimates:
>
> > The CPU version finished in 12 seconds.
> > The CPU->GPU->CPU version finished in 2.2 seconds.
> > The GPU version finished in 1.7 seconds.
>
> I think there is something wrong with the netlib/cublas combination.
> Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
> interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
> through the CPU BLAS interface we need to use NVBLAS, which intercepts
> some Level 3 CPU BLAS calls (including GEMM). So we need to load
> nvblas.so first and then some CPU BLAS library in JNI. I wonder
> whether the setup was correct.
>
> Alexander, could you check whether GPU is used in the netlib-cublas
> experiments? You can tell it by watching CPU/GPU usage.
>
> Best,
> Xiangrui
>
> On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday 
> wrote:
> > Don't use "big O" estimates, always measure. It used to work back in the
> > days when double multiplication was a bottleneck. The computation cost is
> > effectively free on both the CPU and GPU and you're seeing pure copying
> > costs. Also, I'm dubious that cublas is doing what you think it is. Can
> you
> > link me to the source code for DGEMM?
> >
> > I show all of this in my talk, with explanations, I can't stress enough
> how
> > much I recommend that you watch it if you want to understand high
> > performance hardware acceleration for linear algebra :-)
> >
> > On 27 Feb 2015 01:42, "Xiangrui Meng"  wrote:
> >>
> >> The copying overhead should be quadratic on n, while the computation
> >> cost is cubic on n. I can understand that netlib-cublas is slower than
> >> netlib-openblas on small problems. But I'm surprised to see that it is
> >> still 20x slower on 1x1. I did the following on a g2.2xlarge
> >> instance with BIDMat:
> >>
> >> val n = 1
> >>
> >> val f = rand(n, n)
> >> flip; f*f; val rf = flop
> >>
> >> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg =
> flop
> >>
> >> flip; g*g; val rgg = flop
> >>
> >> The CPU version finished in 12 seconds.
> >> The CPU->GPU->CPU version finished in 2.2 seconds.
> >> The GPU version finished in 1.7 seconds.
> >>
> >> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
> >> path. But based on the result, the data copying overhead is definitely
> >> not as big as 20x at n = 1.
> >>
> >> Best,
> >> Xiangrui
> >>
> >>
> >> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday 
> >> wrote:
> >> > I've had some email exchanges with the author of BIDMat: it does
> exactly
> >> > what you need to get the GPU benefit and writes higher level
> algorithms
> >> > entirely in the GPU kernels so that the memory stays there as long as
> >> > possible. The restriction with this approach is that it is only
> offering
> >> > high-level algorithms so is not a toolkit for applied mathematics
> >> > research and development --- but it works well as a toolkit for higher
> >> > level analysis (e.g. for analysts and practitioners).
> >> >
> >> > I believe BIDMat's approach is the best way to get performance out of
> >> > GPU hardware at the moment but I also have strong evidence to suggest
> >> > that the hardware will catch up and the memory transfer costs between
> >> > CPU/GPU will disappear meaning that there will be no need for custom
> GPU
> >> > kernel implementations. i.e. please continue to use BLAS primitives
> when
> >> > writing new algorithms and only go to the GPU for an alternative
> >> > optimised implementation.
> >> >
> >> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and
> offer
> >> > an API that looks like BLAS but takes pointers to special regions in
> the
> >> > GPU memory region. Somebody has written a wrapper around CUDA to
> create
> >> > a proper BLAS library but it only gives marginal performance over the
> >> > CPU because of the memory transfer overhead.
> >> >
> >> > This slide from my talk
> >> >
> >> >   http://fommil.github.io/scalax14/#/11/2
> >> >
> >> > says it all. X axis is matrix size, Y axis is logarithmic time to do
> >> > DGEMM. Black line is the "cheating" time for the GPU and the green
> line
> >> > is after copying the memory to/from the GPU memory. APUs have the
> >> > potential to eliminate the green line.
> >> >
> >> > Best regards,
> >> > Sam
> >> >
> >> >
> >> >
> >> > 

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Xiangrui Meng
Hey Sam,

The running times are not "big O" estimates:

> The CPU version finished in 12 seconds.
> The CPU->GPU->CPU version finished in 2.2 seconds.
> The GPU version finished in 1.7 seconds.

I think there is something wrong with the netlib/cublas combination.
Sam already mentioned that cuBLAS doesn't implement the CPU BLAS
interfaces. I checked the CUDA doc and it seems that to use GPU BLAS
through the CPU BLAS interface we need to use NVBLAS, which intercepts
some Level 3 CPU BLAS calls (including GEMM). So we need to load
nvblas.so first and then some CPU BLAS library in JNI. I wonder
whether the setup was correct.

Alexander, could you check whether GPU is used in the netlib-cublas
experiments? You can tell it by watching CPU/GPU usage.

Best,
Xiangrui

On Thu, Feb 26, 2015 at 10:47 PM, Sam Halliday  wrote:
> Don't use "big O" estimates, always measure. It used to work back in the
> days when double multiplication was a bottleneck. The computation cost is
> effectively free on both the CPU and GPU and you're seeing pure copying
> costs. Also, I'm dubious that cublas is doing what you think it is. Can you
> link me to the source code for DGEMM?
>
> I show all of this in my talk, with explanations, I can't stress enough how
> much I recommend that you watch it if you want to understand high
> performance hardware acceleration for linear algebra :-)
>
> On 27 Feb 2015 01:42, "Xiangrui Meng"  wrote:
>>
>> The copying overhead should be quadratic on n, while the computation
>> cost is cubic on n. I can understand that netlib-cublas is slower than
>> netlib-openblas on small problems. But I'm surprised to see that it is
>> still 20x slower on 1x1. I did the following on a g2.2xlarge
>> instance with BIDMat:
>>
>> val n = 1
>>
>> val f = rand(n, n)
>> flip; f*f; val rf = flop
>>
>> flip; val g = GMat(n, n); g.copyFrom(f); (g*g).toFMat(null); val rg = flop
>>
>> flip; g*g; val rgg = flop
>>
>> The CPU version finished in 12 seconds.
>> The CPU->GPU->CPU version finished in 2.2 seconds.
>> The GPU version finished in 1.7 seconds.
>>
>> I'm not sure whether my CPU->GPU->CPU code simulates the netlib-cublas
>> path. But based on the result, the data copying overhead is definitely
>> not as big as 20x at n = 1.
>>
>> Best,
>> Xiangrui
>>
>>
>> On Thu, Feb 26, 2015 at 2:21 PM, Sam Halliday 
>> wrote:
>> > I've had some email exchanges with the author of BIDMat: it does exactly
>> > what you need to get the GPU benefit and writes higher level algorithms
>> > entirely in the GPU kernels so that the memory stays there as long as
>> > possible. The restriction with this approach is that it is only offering
>> > high-level algorithms so is not a toolkit for applied mathematics
>> > research and development --- but it works well as a toolkit for higher
>> > level analysis (e.g. for analysts and practitioners).
>> >
>> > I believe BIDMat's approach is the best way to get performance out of
>> > GPU hardware at the moment but I also have strong evidence to suggest
>> > that the hardware will catch up and the memory transfer costs between
>> > CPU/GPU will disappear meaning that there will be no need for custom GPU
>> > kernel implementations. i.e. please continue to use BLAS primitives when
>> > writing new algorithms and only go to the GPU for an alternative
>> > optimised implementation.
>> >
>> > Note that CUDA and cuBLAS are *not* BLAS. They are BLAS-like, and offer
>> > an API that looks like BLAS but takes pointers to special regions in the
>> > GPU memory region. Somebody has written a wrapper around CUDA to create
>> > a proper BLAS library but it only gives marginal performance over the
>> > CPU because of the memory transfer overhead.
>> >
>> > This slide from my talk
>> >
>> >   http://fommil.github.io/scalax14/#/11/2
>> >
>> > says it all. X axis is matrix size, Y axis is logarithmic time to do
>> > DGEMM. Black line is the "cheating" time for the GPU and the green line
>> > is after copying the memory to/from the GPU memory. APUs have the
>> > potential to eliminate the green line.
>> >
>> > Best regards,
>> > Sam
>> >
>> >
>> >
>> > "Ulanov, Alexander"  writes:
>> >
>> >> Evan, thank you for the summary. I would like to add some more
>> >> observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 
>> >> vs
>> >> $100). They both are 3 years old. I've also did a small test with modern
>> >> hardware, and the new GPU nVidia Titan was slightly more than 1 order of
>> >> magnitude faster than Intel E5-2650 v2 for the same tests. However, it 
>> >> costs
>> >> as much as CPU ($1200). My takeaway is that GPU is making a better
>> >> price/value progress.
>> >>
>> >>
>> >>
>> >> Xiangrui, I was also surprised that BIDMat-cuda was faster than
>> >> netlib-cuda and the most reasonable explanation is that it holds the 
>> >> result
>> >> in GPU memory, as Sam suggested. At the same time, it is OK because you 
>> >> can
>> >> copy the result back from GPU only when needed. However, to 

Re: trouble with sbt building network-* projects?

2015-02-27 Thread Imran Rashid
well, perhaps I just need to learn to use maven better, but currently I
find sbt much more convenient for continuously running my tests.  I do use
zinc, but I'm looking for continuous testing.  This makes me think I need
sbt for that:
http://stackoverflow.com/questions/11347633/is-there-a-java-continuous-testing-plugin-for-maven

1) I really like that in sbt I can run "~test-only
com.foo.bar.SomeTestSuite" (or whatever other pattern) and just leave that
running as I code, without having to go and explicitly trigger "mvn test"
and wait for the result.

2) I find sbt's handling of sub-projects much simpler (when it works).  I'm
trying to make changes to network/common & network/shuffle, which means I
have to keep cd'ing into network/common, run mvn install, then go back to
network/shuffle and run some other mvn command over there.  I don't want to
run mvn at the root project level, b/c I don't want to wait for it to
compile all the other projects when I just want to run tests in
network/common.  Even with incremental compiling, in my day-to-day coding I
want to entirely skip compiling sql, graphx, mllib etc. -- I have to switch
branches often enough that i end up triggering a full rebuild of those
projects even when I haven't touched them.





On Fri, Feb 27, 2015 at 1:14 PM, Ted Yu  wrote:

> bq. to be able to run my tests in sbt, though, it makes the development
> iterations much faster.
>
> Was the preference for sbt due to long maven build time ?
> Have you started Zinc on your machine ?
>
> Cheers
>
> On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid 
> wrote:
>
>> Has anyone else noticed very strange build behavior in the network-*
>> projects?
>>
>> maven seems to the doing the right, but sbt is very inconsistent.
>> Sometimes when it builds network-shuffle it doesn't know about any of the
>> code in network-common.  Sometimes it will completely skip the java unit
>> tests.  And then some time later, it'll suddenly decide it knows about
>> some
>> more of the java unit tests.  Its not from a simple change, like touching
>> a
>> test file, or a file the test depends on -- nor a restart of sbt.  I am
>> pretty confused.
>>
>>
>> maven had issues when I tried to add scala code to network-common, it
>> would
>> compile the scala code but not make it available to java.  I'm working
>> around that by just coding in java anyhow.  I'd really like to be able to
>> run my tests in sbt, though, it makes the development iterations much
>> faster.
>>
>> thanks,
>> Imran
>>
>
>


Re: trouble with sbt building network-* projects?

2015-02-27 Thread Ted Yu
bq. to be able to run my tests in sbt, though, it makes the development
iterations much faster.

Was the preference for sbt due to long maven build time ?
Have you started Zinc on your machine ?

Cheers

On Fri, Feb 27, 2015 at 11:10 AM, Imran Rashid  wrote:

> Has anyone else noticed very strange build behavior in the network-*
> projects?
>
> maven seems to the doing the right, but sbt is very inconsistent.
> Sometimes when it builds network-shuffle it doesn't know about any of the
> code in network-common.  Sometimes it will completely skip the java unit
> tests.  And then some time later, it'll suddenly decide it knows about some
> more of the java unit tests.  Its not from a simple change, like touching a
> test file, or a file the test depends on -- nor a restart of sbt.  I am
> pretty confused.
>
>
> maven had issues when I tried to add scala code to network-common, it would
> compile the scala code but not make it available to java.  I'm working
> around that by just coding in java anyhow.  I'd really like to be able to
> run my tests in sbt, though, it makes the development iterations much
> faster.
>
> thanks,
> Imran
>


trouble with sbt building network-* projects?

2015-02-27 Thread Imran Rashid
Has anyone else noticed very strange build behavior in the network-*
projects?

maven seems to the doing the right, but sbt is very inconsistent.
Sometimes when it builds network-shuffle it doesn't know about any of the
code in network-common.  Sometimes it will completely skip the java unit
tests.  And then some time later, it'll suddenly decide it knows about some
more of the java unit tests.  Its not from a simple change, like touching a
test file, or a file the test depends on -- nor a restart of sbt.  I am
pretty confused.


maven had issues when I tried to add scala code to network-common, it would
compile the scala code but not make it available to java.  I'm working
around that by just coding in java anyhow.  I'd really like to be able to
run my tests in sbt, though, it makes the development iterations much
faster.

thanks,
Imran