I guess the key is how you initialize blas for breeze. I do not want to
open a topic on breeze here, but since the core operation is dense GEMM
then it is all about speed of JNI, what BLAS library actually ends up being
used by breeze(via netlib i guess, there are tons of choices there), and
the cost of JNI itself in java.

I am pretty sure matlab is using statically linked cpu-specific libraries
from cpu vendors (MKL and AMD). MKL would beat anything in open source
right now, and although it is possible to use MKL with netlib, i am pretty
sure it is not default option as MKL is not an open source product

Without those details it is impossible to say why you see what you see. but
i can assure you, it is all about speed of native cpu-accelerated kernels
and  cost of JNI in the actual algorithms.



On Wed, Sep 9, 2015 at 11:41 PM, Daniel Korzekwa <daniel.korze...@gmail.com>
wrote:

> I already use breeze, actually my current impl of sqDist uses it:
>
> https://github.com/danielkorzekwa/bayes-scala-gp/blob/master/src/main/scala/dk/gp/math/sqDist.scala
>
> still 3 times slower that sq_dist from gpml
>
> thanks for BID Data Project info
>
> On 9 September 2015 at 18:45, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
> > Hi Daniel,
> >
> > you mean, for dense algebra single-threaded java vs. cache,
> multithreaded,
> > SSE4-optimized Intel MKL? I am actually surprised it is not at least 10x.
> >
> > Mahout focuses on ease of distributed implementations (i.e. dsq_dist
> > variant of the routine) but has been somewhat lazy on marrying
> mahout-math
> > with hardware-optimized in-core libraries. That much is true.
> >
> > The things that somewhat downplayed priority of in-core cpu-bound algebra
> > optimizations were:
> >
> > (1)  distributed operations multithreading plays significantly smaller
> role
> > (well-behaved tasks should assume they are allocted only 1 core and rely
> on
> > resource manager to allocate cpu resources)
> > (2) for distrubuted algorithms, unless they are naive power-law ports of
> > in-core algorithms, I/O and data serialization expenses start to play a
> > significant role in overall algorithm performance compared to
> shared-memory
> > single-machine algorithms.
> > (3) a lot of algorithms require non-blas kernel operators anyway
> > (4) most importantly, standard BLAS is somewhat unsatisfactory in the
> > sparse algebra department, I would seek  a better solution than just BLAS
> > API. There are some emerging technologies that are sparse/dense balanced
> > libraries, but the jury is still out as to what best pathway here is. Or
> > maybe, the best path is to do what Teano and BidMat did, i.e. developing
> > new set of algebraic kernel routines, but that's probably too heavy for
> > this project at the moment.
> >
> > If you need a good cpu-bound shared-memory environment for dense algebra,
> > i'd suggest to try either Breeze or BidMat. Perhaps even the latter as it
> > does support sparse subroutines, somewhat anyway, and also has
> GPU-enabled
> > set of matrix implementations.
> >
> > On Wed, Sep 9, 2015 at 12:21 AM, Daniel Korzekwa <
> > daniel.korze...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'm comparing the efficiency of sq_dist() from mahout to sq_dist()
> from
> > > gpml library that is based on bsxfun in octave/matlab.
> > >
> > > It seems that computing the distance matrix in octave is 5 times faster
> > > than in Mahout.Why is that? Can we make it faster?
> > >
> > > Octave:
> > >  x = [1:4000]
> > >  sq_dist(x)
> > >
> > > Scala (Mahout):
> > >   val x = Array.range(1,4000,1).map(i => i.toDouble)
> > >   val A =  new DenseMatrix(Array(x)).transpose()
> > >   val dM = sqDist(A)
> > >
> > > --
> > > Daniel Korzekwa
> > > Machine Learning Engineer
> > > https://www.linkedin.com/in/danielkorzekwa <http://danmachine.com/>
> > >
> >
>
>
>
> --
> Daniel Korzekwa
> Machine Learning Engineer
> https://www.linkedin.com/in/danielkorzekwa <http://danmachine.com/>
>

Reply via email to